IT & Systems Management

United Airlines chaos: Why did it happen?

On July 8th 2015, thanks to a router problem that “degraded network connectivity for various applications” United Airlines was forced to cancel and delay hundreds of flights, leading to widespread nuisance. How could one network device have caused so much trouble? We speak with Tom Griffin, Director of Systems Engineering, EMEA at infrastructure performance monitoring firm SevOne to find out.


United Airlines blamed a faulty network router for the grounding of flights. How could one router have been responsible for so much chaos?

The reality is that under the covers, the infrastructures that support [the network] are incredibly complex. And they are getting more complex rather than less so. Obviously as things become more complex you’ve got multiple different technologies that make up the entire stack and it becomes really difficult to identify potential points of failure.

I think in the airline industry this is compounded by two factors. One is the amount of online activity that those industries are seeing is increasing rapidly. If you even go back, say three or four years ago, I would have typically gone to a check-in desk to check in. Now, it is very rare that I have any interaction with check-in staff at the airport. I check in online and print my own boarding pass. And the interesting thing in the case of this instance is that they have gone through a fairly large merger. Effectively they have doubled the size of their organisation. Typically you have two IT organisations coming together which may be using different technologies and different teams invariably. So that again adds to this whole nature of complexity.

The other most likely cause of these issues in a lot of cases is that they have made some kind of change. Now that could be a software upgrade or a configuration change. Something has changed in the environment to trigger this kind of degradation. The use of the word they used, “degradation”, was interesting especially for us at SevOne, as we monitor performance but it wasn’t that something failed. It was that it slowed down to a point where it was impacting their ability to deliver service.


Obviously with airlines, time is of the essence. When something like this happens, how long does it take for them to identify the root cause of the issue?

I was watching it as it unfolded on CNBC and given the amount of disruption they had it seemed to take them about 4/5 hours to identify the root cause and get everything back online. In an ideal world, you would identify these things near real-time. The reason it takes so long to identify these failures is typically two-fold. One is that the resolution teams that get these things back up and running don’t have the information they need at their fingertips.

The second issue is that you tend to have multiple technology teams that are actually looking at certain aspects of the same fault. I don’t know the specifics of this case but in a lot of modern IT infrastructures significant parts of the infrastructures can be outsourced. So how do you have all the data you need at your fingertips? And be able to pull data from multiple sources so your team is looking at the same information.


Didn’t United have backup systems in place?

The interesting thing about this case is that [it was not because of] a piece of system that failed and the reserve equipment failed to take over. I think what happened here is they had some kind of performance degradation. So it’s the classic case where I used the primary link but actually the backup link isn’t of sufficient capacity to take over. Think of it in the same sense as transport networks – so yesterday in London everyone knew the M25 was going to be absolute chaos because the tubes were on strike. Because people aren’t using public transport you get more volume of traffic on the motorways. Similarly in IT networks if you lose power in one capacity then that traffic has to go somewhere else. And if everything is running close to capacity then it’s unable to cope with that.


So even the backup systems don’t have enough capacity?

They should but not always. One of the challenges you’ve got is that the amount of online activity is growing. And the airline industry is not exactly cash-rich.  Most of the airlines are struggling. So there’s always a trade-off between I don’t want to put excess capacity in place because it costs money. But how do I get the size correct? A lot of this comes down to monitoring the capacity on an on-going basis which is what we do.


Do you think a lot of airlines rush to implement technology before its ready?

I don’t think they necessarily rush in. As an industry they tend to be pretty risk-averse. 


Does this sort of thing happen quite often with airlines maybe on a smaller scale?

I wouldn’t single out airlines. If you look at any connected infrastructure you get glitches and problems. Whether it’s in banking or in a telecoms network, the trend that they all have in common is they are getting larger. There’s also an element of complexity because what we are trying to do is make these networks and systems more intelligent so they can react to change. 

You never get to a point where you eliminate problems completely. You always get human error or software bugs or hardware failure –these issues are going to happen. The main thing is having systems in place to identify when those issues have happened in real-time and then take actions to remedy those as quickly as possible.


What do you think about the speculation that this was the result of a cyber-attack?

The statement from the airline has been that it wasn’t – it was just a single error that caused degradation. I think there’s an on-going concern for sure about the whole area of cyber-attacks and I think what this does demonstrate is the dependence that organisations have on network systems. The fact that they had a network outage and had to ground flights had a really big business impact. So I think, increasingly, organisations are becoming more concerned about the cyber-security angle - because they realise that their businesses are critically dependent on these systems and need to be protected accordingly. But from this instance it seems that [a cyber-attack] wasn’t the cause.


Do you think network technology is outdated in the airline industry?

I think it’s evolving. The network and system technologies are constantly changing and different companies tolerate different levels of change. I would say, in the airline industry in general, some of the systems they are using date back to the 1960s and 1970s – they are still using mainframe based systems. [But they have] put on a modern front-end in terms of their customer interaction which is increasingly driven by the web so they have made huge strides on that side in the last 10 years. So it’s not all outdated but they do use a lot of legacy systems.


Now they have got the network back online. What kind of steps would they have taken to do that?

Typically they would have gone through the usual process of identifying what the cause is and then, depending on what the actual cause was, take the offending system offline and replace it.

Then probably the process they are going through now, which arguably is as important as the initial resolution is doing a post mortem to say what did actually happen here. What went wrong? And how do we prevent it happening again? In reality, we all know that these things can happen again so when this particular event happens how do we identify it faster and resolve it before it’s service impacting. The assumption you always have to make is: it is going to go wrong and so when that happens how do I identify and resolve it as quickly as possible. I suspect that’s what they are probably doing.


Do you think there will be a lot of finger pointing going on in the post mortem?

[Laughs] Typically you try to avoid doing that. It varies from organisation to organisation but you kind of have to have a post mortem where at least at that point there is no blame. You get the facts on the table and deal with them. Because otherwise you end up with outsourced services having to be very protective. That generally does not lead to the truth being exposed. But it depends on the company and culture to a certain extent.


« Crowdsourcing Innovation: Daniel Rogan, Vufine Inc.


PKWARE warn against confusing mere compliance with actual security »
Ayesha Salim

Ayesha Salim is Staff Writer at IDG Connect

  • twt
  • Mail

Recommended for You

Trump hits partial pause on Huawei ban, but 5G concerns persist

Phil Muncaster reports on China and beyond

FinancialForce profits from PSA investment

Martin Veitch's inside track on today’s tech trends

Future-proofing the Middle East

Keri Allan looks at the latest trends and technologies


Do you think your smartphone is making you a workaholic?