IT & Cybersecurity Communication

Building an IT incident response plan when email is down

How to design an IT crisis communication chain that still works when the systems you're trying to fix are the outage.

Most IT incident response plans assume email and Slack are working. The on-call rotation lives in a Confluence page; the escalation list is a distribution group; status updates go to an internal Slack channel. Then a Microsoft 365 outage takes mail and Teams down for four hours, and the response team finds out their entire plan was hosted inside the system that broke. The fix isn't a thicker binder — it's an IT incident response plan that assumes its own primary tools can be the outage.

The circular dependency that breaks IT incident response

The pattern repeats across every postmortem. The runbook is in a wiki. The wiki authenticates against the same SSO that's down. The on-call schedule is in a paging tool, but the alert routes through email, which is broken. The incident channel is in Slack, but Slack's status page is yellow. Every link in the response chain depends on the same identity, network, or message bus that triggered the incident in the first place.

A defensible IT incident response plan starts by mapping that dependency graph and breaking at least one link in it on purpose.

Map your dependencies before you map your runbook

Before you write a single response step, list every tool the response itself relies on:

  • Identity provider (SSO, MFA).
  • Notification path (email, Slack, Teams, paging tool).
  • Knowledge base (Confluence, Notion, internal wiki).
  • Status page hosting (internal vs. third-party).
  • VPN or zero-trust gateway.

For each, ask one question: if this is the outage, can the team still respond? The boxes that answer "no" are your circular dependencies. They need an out-of-band alternative documented before the next incident, not during it.

Define out-of-band channels for IT crisis communication

"Out-of-band" means the channel doesn't share infrastructure with the system that's down. Practically, that means:

  • SMS to personal mobile numbers, which routes through carriers, not your mail server.
  • Voice calls, for staff who don't read texts quickly.
  • Mobile push notifications from a cloud-hosted alerting app, which lands even if your office network is down.
  • Desktop alerts that fire from a service running outside your domain controller.

These should not all live in the same vendor. If your IT outage communication tool is hosted on the same cloud region as the systems you're trying to monitor, it isn't out-of-band — it's just relabeled.

Product spotlight

A comms channel that's still alive — with the playbook attached

Castatus runs over SMS, voice, mobile push, and desktop alerts independently of your email and Slack stack, and pushes the right runbook to the right person based on role and event type — so the on-call lead opens the alert with the playbook already attached.

See how it works

Build the on-call escalation chain — twice

Every on-call schedule should exist in two places: the system that pages you on a normal day, and a flat, exportable copy that lives off that system. A printed list in the NOC, a PDF on a USB drive, a phone-tree card in the on-call lead's wallet — anything that can be reached when SSO is down. Update it quarterly, and on every personnel change.

The chain itself should escalate by time, not by acknowledgment. If the primary doesn't respond in five minutes, the system pages the secondary automatically. Don't wait for a human to notice the silence.

Push the playbook with the page

The runbook problem isn't that it doesn't exist — it's that the on-call engineer has to find it, log into the wiki, navigate to the right page, and remember which version is current. At 2 a.m., during an outage that just took SSO down, that's three failure modes too many. A modern alerting platform can ship the relevant runbook with the page itself, scoped to who's receiving it and what kind of event fired.

Map two dimensions to make this work:

  • Role to runbook. The on-call DBA gets the database failover playbook. The network lead gets the DNS/BGP recovery doc. The comms lead gets customer message templates. The security analyst gets the IR plan.
  • Event type to context. A Sev-1 outage page should arrive with the postmortem template and the incident-channel link. A phishing alert should arrive with the takedown contacts and the user-notification template.

Stored once, attached automatically. The right person opens the alert and the playbook is already there — no SSO challenge, no wiki search, no tab-juggling at 2 a.m.

Pre-stage status messages for the three audiences

An IT outage has three audiences, and each one needs different language at a different cadence:

  1. Internal staff need to know what's down, what to do instead, and when the next update is coming. Short, calm, every 15–30 minutes.
  2. Leadership needs scope, business impact, and an ETA. Updated less frequently but with more detail.
  3. Customers and partners need a public-facing, plain-language status update tied to your crisis manager playbook so the timing matches the internal one.

Pre-write three templates for each audience: incident confirmed, ongoing investigation, resolved. Templates remove the cognitive load of drafting calmly while a system is on fire.

 
Tip. Store templates in two places — your normal knowledge base and a markdown file in a private repo on a personal device. The day SSO breaks, you'll be glad the second copy exists.

Run a tabletop with email turned off

An IT incident response plan you've never exercised against your actual primary channels is theoretical. Run a quarterly tabletop with one rule: assume email, Slack, and SSO are unavailable. Walk through the first 30 minutes of an incident — paging, status messaging, customer comms — using only the out-of-band channels you've set up. The first time you do this, expect surprises. That's the point. NIST's Cybersecurity Framework response functions are a useful structure if you don't want to write your own from scratch.

A response plan that runs on the systems you're paid to keep up isn't a plan — it's wishful thinking.

Common gaps in IT outage communication

  • Personal mobile numbers not collected. HR has them; the on-call system often doesn't.
  • Status page on the same cloud region as the product. When us-east-1 goes down, so does the page that says us-east-1 is down.
  • Customer comms gated by a marketing approval flow. A 45-minute approval is incompatible with a 5-minute incident.
  • No defined "all-clear" trigger. Incidents don't end officially, so they don't end at all.
  • Documentation locked behind SSO. The runbook is unreachable when you need it most.
 
Watch out. A status page hosted on the same provider as your product fails when your product fails. Host it elsewhere, even if "elsewhere" is a static page on a different cloud.

What to do this week

  • Map every tool your incident response depends on. Mark each one in-band or out-of-band.
  • Confirm every on-call engineer has a current personal mobile number registered.
  • Pre-write three status templates per audience (internal, leadership, customer).
  • Move at least one copy of the on-call schedule and runbook outside SSO.
  • Map at least one role-and-event-type pair to a runbook your alerting platform can push automatically.
  • Schedule a 30-minute tabletop with email and Slack assumed dark.

The point of an IT incident response plan isn't to look thorough on paper — it's to function on the worst day of the year. Start by assuming your primary tools are the outage, and design backwards from there. The next incident will tell you whether you got it right.

Ready to see how Castatus handles this?

Get a walkthrough of how the Castatus Cloud platform applies what you just read.

Request a demo
Get In Touch