Surviving a Ransomware Attack Part 2: Lessons from the Enterprise Applications Group

This is Part 2 of our ransomware series. Part 1 covered the war room experience and lessons about teamwork under pressure. This post focuses on what we learned about legacy application recovery. The hard way.

Seth wrote about the war room. The 16-hour days. The team alignment that made recovery possible. All of that is true. But there's another story that needs telling: what happens when you try to restore enterprise applications that were never designed to be restored.

I led the Enterprise Applications recovery effort. What I learned will stay with me for the rest of my career.

Day Zero: What to Shut Down, What to Preserve

When the attack hit, we had seconds to make decisions that would take weeks to undo. The instinct is to isolate everything. Pull the plug. Stop the spread. That instinct is correct. But it comes with consequences nobody thinks about until they're living them.

Here's what we learned:

Shut down immediately: Anything that's actively being encrypted. Anything that could spread laterally. Anything connected to critical data stores. Don't hesitate. The data you lose by shutting down fast is nothing compared to what you lose by waiting.

Preserve if possible: Logs. Connection states. Running process information. Memory dumps if you can get them. The forensics team will need this. The recovery team will need this. You won't have time to think about it later, so think about it now. Before you're in crisis.

Document immediately: What was running. What state it was in. What connections were active. Write it down. Take screenshots. You will not remember, and you will need this information desperately in about 72 hours.

Secure your documentation: Quickly place all available documentation in a central and secure location. Ours was in Confluence, which went offline with everything else. If your docs are on the same infrastructure you're trying to recover, you don't have docs.

Audit what's missing: Once you have what documentation you can get, identify the gaps. You'll need answers to these questions fast:

  • Batch cycle documentation: legacy apps often use flat files for integrations. Where are they in the cycle? What runs when?
  • Where were we EXACTLY in the batch cycles when we went down?
  • Certificate and key inventory: what certs and keys are needed to bring up new app servers? Do you even have them?
  • Data loss assessment: how much data have we lost, app by app?
  • Operational Readiness Test (ORT): a plan to prove you're ready to go back into production. This single document outlines the people, processes, and technology that all have to be ready, and how to objectively prove that readiness. Getting back into production is a moonshot. You must get it right to prevent further catastrophic impact to reputation, brand, and revenue. The ORT becomes the core of your "get it turned back on" plan.

The Backup Myth

Everyone thinks they have backups. Very few organizations have recoverable backups.

Here's the reality of enterprise application backups that nobody talks about: there is no such thing as a synchronized backup across an entire application stack. Your database might do point-in-time recovery. Your file systems have snapshots. Your application servers have images. But they're all on different schedules, different retention policies, different recovery mechanisms.

Now multiply that by every application in your portfolio.

A typical financial application stack might include:

  • A database with transaction logs and PITR capability
  • Flat file interfaces running batch processes on their own schedule
  • Integration points with HR, legal, ERP, each with their own backup windows
  • Legacy components that haven't been touched in years
  • Third-party connectors with their own data stores

Unless you can magically restore every component to the exact same moment in time (and you can't), you will incur data loss. The question isn't whether you'll lose data. The question is how much and where.

When to Abandon Point-in-Time Recovery

Point-in-time recovery sounds great in theory. Pick a moment before the attack. Restore to that moment. Resume operations.

In practice, PITR only works when:

  • All components support it (they don't)
  • All components can recover to the same timestamp (they can't)
  • Your transaction logs are intact (attackers know to target these)
  • You know exactly when the attack started (you often don't)

There comes a moment in every major recovery when you have to make the call: stop chasing perfect recovery and fall back to your last known-good weekly or monthly backup. Yes, you'll lose more data. But you'll actually recover.

The organizations that recover fastest are the ones that make this call early. The ones that struggle are the ones that spend days trying to achieve perfect PITR across systems that were never designed for it.

The Documentation You Wish You Had

In the middle of a crisis, you will desperately need documentation that doesn't exist. Here's the minimum you need for legacy app restoration:

As-Built Documentation: Not the architecture diagram from five years ago. The actual current state. What servers. What connections. What ports. What credentials. What dependencies. What order things need to start in.

Offline copies: If your documentation is on a SharePoint that's also encrypted, you don't have documentation. Critical system docs need to exist offline, updated regularly, accessible when everything else is down.

Recovery runbooks: Step-by-step procedures for restoring each application. Tested procedures. If your runbook says "restore from backup" without specifying which backup, where it is, how to access it, and what order to restore components, it's useless.

Dependency maps: This app talks to that app which depends on this service which requires that database. You need to know the full chain. And it needs to be current.

If you don't have this documentation, you'll be reverse-engineering it during the crisis. That's possible. We did it. But it costs you days you don't have.

Reverse Engineering What You Don't Have

For legacy applications without documentation, you're going to reverse-engineer. Here's the approach that worked for us:

Start with network captures. Before the attack, what was talking to what? Firewall logs, network flow data, packet captures if you have them. This tells you the actual dependencies, not the documented ones.

Interview the operators. The people who actually run these systems know things that aren't written down anywhere. Pull them in. Ask them to walk through a normal day. What do they check? What do they restart when it breaks? What's the tribal knowledge?

Trace the data flows. Follow a transaction from start to finish. What systems does it touch? In what order? What happens if one component is missing?

Test in isolation. Bring up components one at a time. See what fails. The error messages will tell you what's missing, if you can read them. Document as you go. You're building the runbook you should have had.

This is slow, painful work. But it's the only way when the documentation doesn't exist.

The Legacy Platform Problem

Every large enterprise has them. Legacy platforms running critical business processes. Systems so old that the original vendors no longer exist. Licensing models that don't account for disaster recovery. Knowledge that walked out the door when the last expert retired.

In our case, we had PICK applications that were critical to operations. If you've never encountered PICK, it's a database environment dating back to the 1960s. Arcane doesn't begin to describe it.

These legacy systems share common traits:

  • They're unsupported but critical to operations
  • Their licensing or activation is tied to specific hardware
  • You can restore the data, but you can't restore the ability to run
  • The people who understood them are gone

When you can't quickly reverse-engineer these systems in the middle of a crisis, you're looking at days of downtime. When revenue loss is measured in millions per day, that's not an abstract problem.

How does every organization get here? The same way. Leadership prioritizes new development over legacy sunsetting. In the moment, that decision makes sense. New capabilities. Competitive advantage. Revenue growth. The legacy stuff works, so why touch it?

Until the day it doesn't work, and you discover you can't bring it back.

The Exception Cycle

After an attack, the cyber experts tell you what you already know: these legacy systems on unsupported operating systems are a risk. They should be quarantined. They should not come back online until they're remediated or replaced.

Management agrees. Of course they agree. The risk is obvious. The systems will stay offline until they're properly addressed.

Then two weeks pass. The revenue losses mount. Every day, the pressure increases. The business units are screaming. Customers are leaving. The board is asking questions.

And then the exceptions start.

"Just this one system. Just until we can find an alternative. Just temporarily."

Two weeks into a recovery, with revenue hemorrhaging at millions per day, the calculus changes. Systems that were quarantined get exception approvals. The vulnerable boxes come back online because the alternative is the business grinding to a halt.

This is the moment that defines whether an organization actually changes, or just survives until next time.

What I'd Do Differently

If I could go back and prepare for this before it happened, here's what I'd prioritize:

Document everything, offline. As-built documentation for every critical system. Printed if necessary. Stored somewhere that doesn't depend on your network being up.

Test your restores. Not just "does the backup exist" but "can we actually bring this system back from nothing." Do it annually at minimum. Do it for your oldest, ugliest systems first.

Know your true dependencies. Not the architecture diagram. The actual runtime dependencies. What talks to what. What fails when something is missing.

Sunset before you have to. Every year you delay replacing that legacy system is a year of accumulated risk. The cost of replacement looks high until you compare it to the cost of not being able to recover.

Accept that PITR is a myth for complex stacks. Plan for falling back to periodic backups. Know how much data loss you can accept. Make that decision before you're in crisis.

The Real Lesson

The technical lessons matter. Backups, documentation, runbooks, dependency maps. All of it matters. But the real lesson is simpler and harder:

The systems you've been ignoring because they "just work" will be the ones that destroy you when everything is burning.

Legacy systems don't announce when they've become unrecoverable. They just sit there, running, year after year, while the knowledge to restore them slowly disappears. The vendor goes out of business. The expert retires. The documentation gets stale. The recovery procedures go untested.

And then one day, you need to restore them, and you discover that you can't.

Don't wait for that day to find out.

- George Milliken, Co-Founder, The SRE Project

Read Part 1 Back to Blog