Metro Reasons: Failed Metro networking equipment blinds the control center for three hours

Metro's Rail Operations Control Center. Screenshot from WMATA video.

What happens when the Metro Rail Operations Control Center (ROCC) loses the ability to control the rail system? The agency experienced at least three of these sorts of events last year in 2017, and one of them stretched on for nearly three and a half hours.

When everything is running smoothly, the ROCC is based out of a Metro building in Landover, Maryland. Built for automation, equipment all across the railroad feeds data back to Metro’s network and eventually finds its way to the ROCC. The information feeds an automated information management system which plots where trains are and allows Rail Traffic Controllers to run the railroad, similar to how Air Traffic Controllers space and separate planes.

The ROCC is the “brain” of the Metrorail network. Originally located in a cramped room inside Metro’s headquarters downtown, it was (and still is) meant to be the central location where all activities on the railroad are monitored.

When the Automatic Train Operation system was active so that trains would automatically drive themselves between stations, RTCs had more of a hands-off approach, stepping in when necessary. With operators on the trains and trains that could move themselves between stations, they didn’t really have much to do whenever everything was operating smoothly.

But when the ROCC loses the ability to do its job, like it did on June 16 of last year, management has the ability to declare a Code 34. In Metro parlance, a Code 34 is when “the ROCC is unable to remotely control and monitor the system for more than a few minutes.” ROCC monitoring includes train locations, switch and interlocking signals, fan operations, third rail power, and more.

During a Code 34, there’s little that controllers can see or do to monitor the railroad. They can’t see train locations, they can’t control fans or power, and they cannot remotely control the system’s switches, if need be. It’d be as if Metro was a 19th-century railroad tracking train locations on paper and relying on personnel “on the ground” to control everything.

June 16, 2017 — the network goes down

At 7:09 pm on June 16th, 2017, the beginning of a Code 34 was underway. ROCC personnel noticed that equipment throughout the railroad known as RTUs, which feed in data from track equipment, were offline. Supervisors began to be dispatched to all major junctions where tracks merge or split in case they needed to be operated manually on site. The Code 34 was officially declared at 7:29 pm.

A Metro conference call was started 15 minutes later at 7:44 pm to keep those affected in the loop on updates for fixes to the ROCC outage.

A Metro incident report says that approximately 45 minutes later after IT department personnel were performing troubleshooting, they discovered that “the firewalls were not able to manage network traffic.” That is, the devices designed to route computer traffic in a network were failing at the primary thing they’re meant to do. The firewalls were rebooted two minutes later at 8:38 pm, and they were back up and the ROCC was starting to run normally at 8:57 pm.

Image from a Metro Request For Information (RFI) showing a high-level overview of the two ROCC locations’ network interconnects and firewalls to the greater WMATA network. Image from Metro.

The ROCC lost visibility to the rail network at least two other times in 2017 — on May 4 and November 28 — although neither of those lasted nearly as long (around five minutes and one hour, respectively).

A software defect is the cause of the outage

A “known software defect” was the cause of the outage in June, according to a Metro incident report. The software defect “causes network traffic to stop passing through the devices,” essentially rendering them useless.

Earlier Metro documents indicate that the agency buys network and firewall equipment from Cisco, one of the major networking vendors. While it cannot be confirmed, a service advisory from Cisco appears as though it may be the culprit for Metro’s networking issue.

The company in March of 2017 issued a “Urgent Proactive Customer Notification” warning about a bug in some of their firewall equipment. The issue would “cause the device to stop passing traffic after 213 days after of uptime.”

To resolve the issue, Cisco told customers to “proactively reboot their Cisco ASA or FTD devices that are running affected versions.” They indicated that there should be fixes available before the next time customers would have to reboot the firewalls again.

It is unclear as to why Metro was affected by a known issue acknowledged by the vendor who provided a workaround solution.

Centralized, but distributed, Metro’s ATP system keeps working

One interesting thing about the occurrence of a Code 34 is that it has no effect on the safety of train spacing within the rail system. While all railroad information is fed back to the ROCC, the ROCC itself is not part of the critical path in which those decisions are made. A string of Train Control Rooms spread out along the railroad include all the equipment needed to know where trains are and to ensure they don’t get too close.

While controllers can’t see where trains are when the ROCC is disconnected from the network, the ATC system (specifically, Automatic Train Protection, which keeps trains spaced safely) keeps on going, as this distributed between the TCRs. A centralized system like Communications-Based Train Control might be affected under the same circumstances.

Metro Reasons is a regular breaking news, investigative reporting, and analysis column by Stephen Repetski about everything Metro. Please send tips to Metro Reasons.

Thanks for reading!