featured image 76 Australia’s Second Largest Telco Went Dark, And Chaos Reigned

Engineers tend to worry about uptime, whether it’s at a corporate server farm or just our own little hobby servers at home. Every now and then, something will go wrong and take a box offline, which requires a little human intervention to fix. Ideally, you’ll still have a command link that stays up so you can fix the problem. Lose that, though, and you’re in a whole lick of trouble.

That’s precisely what happened to Australia’s second largest telecommunications provider earlier this month. Systems went down, millions lost connectivity, and company techs were left scrambling to put the pieces back together. Let’s dive in and explore what happened on Optus’s most embarrassing day in recent memory.

Where to Go?

It all went down in the wee small hours of November 8, around 4:05 AM, when a routine software upgrade was scheduled. As part of the upgrade, there was a change to routing information for the Border Gateway Protocol (BGP) for Optus’s network from an international peering network. According to the company’s analysis after the event, “These routing information changes propogated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”

That’s all a bit of a mess, so what does it mean? Well, fundamentally, the BGP routing information tells Optus’s routers where to find other machines on the internet. The routing information updates came from a Singtel internet exchange, STiX, which Optus uses to access the global internet. What happened is that the updates overwhelmed Optus’s own routers, which shut down in response to reaching a certain default threshold level of route updates. These limits are pre-configured into the router equipment from the factory. As this occurred in routers on Optus’s core network, as they went offline, they took down the telco’s entire national network, affecting voice, mobile, and internet customers.

Engineers spent the first six hours investigating various causes of the incident, while millions were waking up to dead internet connections and phones without signal. Crews rolled back recent changes by Optus itself and looked into whether they were under some kind of DDoS attack. In the end, the engineers determined the issue of the routers self-isolating to avoid the overwhelm of routing information updates that had propagated through the network. Resetting routing back to normal was enough to get networks back online, with engineers carefully reintroducing traffic to Optus’s backbone to avoid any unseemly surprises during the process. Optus eventually put the blame on the automatic safety mechanisms, stating “It is now understood that the outage occurred due to approximately 90 PE routers automatically self-isolating in order to protect themselves from an overload of IP routing information. These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).” It perhaps implies that the self-protection limits are unduly cautious and took the network offline when it was not really necessary to do so.

According to Optus, 150 engineers were directly involved in investigating the problem and restoring service, with another 250 staff and 5 vendors working in support. Meanwhile, the efforts to get back online were frustrated by the fact that, with Optus’s network down, it was difficult for technicians to actually access machines on the network to fix the problem. Ultimately, it would take a full fourteen hours for Optus to get its systems fully back online, with technicians having to attend some equipment in person to get it back online.

Optus isn’t the only company to have had issues with a major BGP meltdown. Facebook famously disappeared from the internet in 2021 for a few hours when it got the settings wrong on a few of its own backbone routers.

The Aftermath

The result of this unprecedented outage was Optus temporarily becoming public enemy #1 in the Australian media. Millions across the nation had spent the day with no internet connection, no mobile connectivity, and few to little updates from Optus about what was actually going on. Customers had to get their updates via conventional media like newspapers, radio, and television—as they had no way to access the internet or receive calls via their own devices. Thankfully, cellular users were at least able to contact emergency services via alternate cellular networks, but landline users were cut off.

Businesses relying on EFTPOS payment terminals with Optus SIM cards were unable to take payments, while banks, hospitals, and even some train services were affected. The Melbourne train network underwent a one-hour shutdown as drivers could not communicate with the control centre, with hundreds of trains cancelled throughout the day. As for Optus itself, it shed $2 billion in value on the stock market as the day wore on, with CEO Kelly Bayer Rosmarin resigning a few weeks later due to the outage. Thus far, the company has offered customers 200 GB of free data as restitution for the outage. It’s proven cold comfort for many, particularly those in small businesses who lost out on hundreds or thousands of dollars in trade during the period.

The only winners in the scenario were Optus’s main competitors, namely, Telstra and Vodafone. The two companies run competing cellular networks as well as offering home internet connections across the nation. With this disaster occurring only a year after a major data breach at Optus saw customers compromised en masse, the two companies will be seeing dollar signs when it comes to stealing their rivals’ customer base.

Ultimately, there’s a lesson to be learned from Optus’s downfall. Crucial systems should be able to handle a routine update without collapsing en masse, even if something goes wrong. In 2023, customers simply won’t accept losing connectivity for 14 hours, especially if it’s due to some poorly-configured equipment. Connectivity is now almost as important to people as the air they breathe and the water they drink. Take that away and they get very upset, very quickly indeed.
Source and Read More: https://hackaday.com/2023/11/22/australias-second-largest-telco-went-dark-and-chaos-reigned/

Related Post