Backup power equipment failures 'brought GNB down' - Action News
Home WebMail Saturday, November 23, 2024, 05:26 AM | Calgary | -11.9°C | Regions Advertise Login | Our platform is in maintenance mode. Some URLs may not be available. |
New BrunswickCBC Investigates

Backup power equipment failures 'brought GNB down'

Premier Brian Gallant says he will "get to the bottom" of what happened in June when government servers experienced two hard crashes in one morning after backup power equipment failed at the Marysville Data Centre. A CBC News investigation has uncovered new details about the unplanned outage after obtaining internal records through the Right to Information Act.

Government silent on cost of IT infrastructure failure repairs, recovery

The Marysville Data Centre is located in Marysville Place in Fredericton. (Courtesy mynewbrunswick.ca)

Premier Brian Gallant says he will "get to the bottom" of what happened in Junewhen government servers experienced two hard crashes in one morning after backup power equipment failed at the Marysville Data Centre (MDC).

The MDC houses IT systems used by the departments of Justice, Health, Public Safety, Social Development, Financeand more.

The repercussions were far-reaching, took days to repairand even solicited involvement from then-premier DavidAlward.

"Our government takes the responsibility of protecting New Brunswickers privacy and personal data seriously," Gallant said in a statement on Tuesday, in response to a CBC News investigation about the unplanned power outage.

The government will "take steps to ensure it doesn't happen again," Gallant said.

An external review initiated by the former government is already underway, he said. "It is our intention to release findings and recommendations in as timely a way as possible."

CBC News uncovered new details about the unplanned outage after obtaining internal records through the Right to Information and Protection of Privacy Act.

The events began the morning of June 9. An osprey was building a nest on a NB Power transmission line.

The bird, one of hundreds that do the same thing across the provinces power grid every spring, somehow shorted the power during its build. Customers around the greater Fredericton area lost power, including the Marysville Data Centre.

The first warning came at 9:27 a.m.: We're running on battery power for now, wrote Trevor McDonald with the Internal Services Agency (NBISA).

Data centre outage impacts

  • Some computer services godown for the justice system, leaving court "decisions in limbo because of documents they can't access . . . [and] people are possibly sitting in holding [cells] that could be, well, not in holding." A court decision involving some shale gasprotestersis delayed because the judge can't access computer files.
  • Major systems in the health and motor vehicle department go down. Concern is expressed about health-care providers in the health department's mobile crisis unit for mental health and addictions not having computer access to a client's history at their fingertips in critical cases during one of the planned outages for repairs to the data centre.
  • Service New Brunswick turns away people wanting to renew their drivers licence. Itswebsite is alsooffline. Automobile dealers were unable to complete online paperwork on the Service New Brunswick system
  • Some civil servants were unable to work for hours, or even days. Overtime was needed in some cases to clear backlogs.
  • NB Liquor was unable to process credit card ordebit card transactions at its stores, leaving its retail operation "basically dead in the water" for hours

If power doesnt return in the next 15-20 minutes, there will be a hard shutdown of systems.

There was a problem. An Automatic Transfer Switch (ATS), which is supposed to connect the data centre to a diesel generator when street power is lost, didnt work.

While technicians rushed to do the switch manually, the data centre was running on a giant battery pack called an Uninterruptible Power Supply (UPS). But the batteries ran out before the backup diesel generator could power the data centre.

The government of NB mainframe was gracefully shut down before the UPS batteries hit their threshold to support the load, all other systems went down hard, wrote a Bell Aliantoperations manager at the data centre.

The switch to diesel power was made, but two hours later the data centre hard-crashed again while technicians were trying to repair the ATS. This time the mainframe too, went down hard.

'Yes. No. F--kit is bad'

IT staff worked all day and into the early morning hours the next day to restore critical systems, corporate services and to bring back public-facing websites once the backup power was restored.

Internal Services employees working on the restore characterized the situation with colourful language in their correspondence. Yes. No. F--k it is bad. responded one employee when asked if there is an ETA on recovery.

Another wrote Yeah whole data centre basically puking. Power issues. when someone inquired about their lost connection to various programs.

In the public too, commentary was flying. Melanie Morris tweeted, My one day off this week I go to renew my license and Service New Brunswicks systems are down. Hopefully I wont get pulled over!

She told CBC News that Service New Brunswick employees politely told her they couldnt renew her licence because all of their systems were down.

To get there and find out nothing is working, turn back around, come out, I was a little frustrated.

She added she returned the following Saturday to renew her licence.

Melanie Morris was turned away at Service New Brunswick on June 9 when she tried to renew her drivers' licence. (CBC)
The inconvenience among citizens was probably felt more strongly by individuals whose legal proceedings were delayed because of Department of Justice IT issues.

There were a lot of things that ended up being adjourned, says criminal defence lawyer Alison Mnard.

On June 19Mnards clients, Germain Junior Breau of Upper Rexton, N.B., and Aaron Francis of Eskasoni, N.S., were in custody after having previously pleaded guilty to some charges related to the violent fracking protests on Oct.17, 2013.

The pair was awaiting a decision on other charges they had refuted in relation to the violence, but that decision was delayed because Justice Leslie Jackson couldnt access critical files in the case.

It definitely leads to frustration for people who are incarcerated, said Mnard.

Criminal defence lawyer Alison Menard had clients whose cases in courts were delayed because of IT problems in the justice system caused by the data centre power issues. (CBC)
If you havent already been convicted, its difficult to understand, but the time you spend in jail prior to conviction is a very difficult time for people incarcerated that way.

On June 11a technician was assigned to help specifically with issues hindering the work of the justice system. One email remarks:

Yeah, and it gets worse. They have decisions in limbo because of documents they cant access so people are possibly sitting in holding that could be, well, not in holding. I wonder if they thought out whos network drives should have been restored first.

The hard-crashes caused a ripple effect in some departments. Data-corruption in data storage systems caused issues which took days to recover from in a number of departments.

A June 10 email from Christian Couturier, the province's chief information officer, states:The bigger issues are with data potentially corrupted on the storage array. This is translating into major systems being down (in health and motor vehicle). The shared services agency is working to resolve (identify and restore data). Rough day here.

A June 17 email from an Angie Milbury, aNBISA director,statesThe purpose of this message is to explain to IT directors the approach we are taking to ensure all GNB File Server data affected by the outage is recovered.

In order to effectively deal with the large number of restores resulting from the outage we are finalizing a plan to do a restore of file server data from the June 6thbackup it continued.

'Youre playing with fire'

On the morning of the June 9, NB Power had restored power within an hour, but in the course of the first crash, the UPS was fried. An email with the subject line fuse and an image attached says, This is what brought GNB down.

Records show it was soon learned that more than fuses were fried in the 25-year-old unit. Without the UPS, the MDC could not be switched back to street power. It continued to operate on diesel power for two weeks.

Emails state the rate of fuel consumption ranged from 50 to 85 litres an hour.

Further complicating the issue the morning June9, after the effort it took to connect the diesel power to the MDC, the generator started having problems maintaining frequency, prompting a scramble to find a portable generator to replace the usual redundancy.

It is not clear from the records exactly what caused the backup power system to fail. There was regular maintenance and testing performed on the system. In fact there was maintenance to the UPS the day before the outage.

Stphane Bertini, president of Montreal-based Zonesa IT company, said the issues with backup power for the Marysville Data Centre "could be a question of money." (CBC)
To have equipment that is 25 years old that raises a flag right away," saidStphane Bertini, president of Montreal-based Zonesa IT company, who has worked in and operated data centres for over 20 years.

... if its only one generator [as a back up], then youre not redundant. So basically youre playing with fire, said Bertini.

To have equipment that is 25 years old that raises a flag right away- StphaneBertini, president of Montreal-basedZonesa

Nothing is 100 per centfool-proof, but there are ways to maximize uptime. And its basic ways. If you know what youre doing it is pretty easy to guarantee uptime. But again, its money too. If you dont have the money to have twice the equipment then youre stuck. This could be a question of money.

Bertini isnt alone in wondering what was behind the failure.

'Significant financial and productivity impact'

Ten days after the crash then-premier David Alward wroteto the provinces top bureaucrat, Marc Lger, the chief of the executivecouncil and secretary to cabinet.

Alward askedLger to head up a committee to examine the technology failure and to figure out how to avoid anything similar in the future.

As you are aware, the failure at Marysville Place that started June 9, 2014, had a significant financial and productivity impact on the Government of New Brunswick. I am aware that it negatively impacted services to citizens and businesses, and disrupted offices across the network, stated Alward.

Were you affected by the loss of provincial government computer service after the osprey-related power outage? Contact us atcbcnb@cbc.cato tell us your story.

In addition to the clerks committee, Auditor General Kim McPherson is also examining the circumstances surrounding the outage to determine whether it should be a subject in her next report.

CBC News also learned that Ernst & Young was commissioned to study the outage and was due to deliver its draft report to the clerks committee Monday, but government would neither confirm nor deny that was the case.

CBC News requested both research interviews and on-camera interviews, and sent dozens of questions to sevendifferent government departments, asking about the impacts of the outage, seeking to understand in better detail exactly what the real impacts were, how long they lasted and how they were resolved, as well ashow much the outage cost taxpayers for replacement parts, overtime, and productivity downtime.

None of the questions were answered and all of the interviews were refused.

In a statement, Government Services spokesperson Sarah Ketcheson wrote, At this time, an incident review of the June 9 outage is being undertaken. This is a standard procedure when an event occurs which has an impact on our IT system. Once the review is complete and government has had the appropriate amount of time to study the final report, we will be in a better position to respond to your questions.

'Smoke and stuff'

On June 20, two days after the premiers letter and almost two weeks after the initial outage, the diesel generator powering MDC failed.

An 8:18 a.m.email from an NBISA employee states most services are back up again after the portable generator went down hard this morning (smoke and stuff apparently).

Two days later, a temporary UPS, housed in a tractor-trailer on rent from a Pennsylvania company for about $52,000 a month, was connected and allowed the MDC to be reconnected to street power.

Chronology of events

Memo from David Alward