Incident Response · Security Champions

Module 8a · Incident Response

Security Champions · Module 8a

Incident Response

A structured approach to handling security incidents. Part 1 of 2 — the PICERL cycle.

Cover

Your journey · Program map

Your journey so far

9 modules. One toolkit. You are at Module 8.

Context

Your toolkit · So far

What you already have

Modules 1–7

M1–2

Champion role, sprint workflow, business language

How you operate day-to-day as a Champion

M3

Risk vocabulary: OWASP Top 10, five lenses

What to look for in every product you touch

M4

Threat finding: STRIDE → DREAD → tickets

How you hunt and file threats

M5–7

Where threats live: supply chain, secrets, config

Controls for code, credentials, and configuration

Context

Module 8 · When prevention isn't enough

When prevention isn't enough

Modules 2–7 built your prevention toolkit. But prevention has limits. At some point — not if, when — something will get through.

How do you respond when the pager goes off at 2 AM? How do you contain without destroying evidence? How do you communicate during chaos?

You'll learn three things:

1

PICERL cycle

Prepare → Identify → Contain → Eradicate → Recover → Learn

2

Your first runbook

A concrete, testable procedure — not a policy document

3

Your IR role as Champion

The product's representative providing architecture context the IR team can't get elsewhere

Context

Act 1 · Definitions

The foundation

An event is an observable change in a system. The word that matters is observable — if your monitoring doesn't capture it, if nobody notices it, the event doesn't exist for you.

Somewhere right now, a server is being compromised. Without observation, it's invisible.

Your ability to respond to incidents depends entirely on your ability to observe events. Monitoring coverage defines the boundary of what you can protect.

05 / 31 · Definitions

Act 1 · Definitions

From event to incident

An incident is an event that has caused — or could cause — negative impact to your business.

This definition is broader than "the system is down." Unauthorized data access is an incident even when the system works perfectly. Undetected data modification is an incident. A published vulnerability affecting your stack can be treated as an incident.

Some organizations classify published vulnerabilities as incidents — it triggers the response process, creates a ticket, and ensures someone addresses it. Whether you include precursors in your incident definition is a design choice, but they must be captured by some process.

06 / 31 · Definitions

Time to think · The observation gap

Better monitoring closes the first gap. Better processes close the second.

Time to think

Act 1 · Definitions

Already happened vs might happen

Indicators — evidence of something that already occurred. Unauthorized access in logs. Customer data on a dark web marketplace.

Precursors — signals of something that could happen. A critical vulnerability published. An unrotated service account with admin access.

A new critical Kubernetes vulnerability is published. Is it an incident for your team? If yes, it enters the response pipeline and someone patches it. If no, it must be captured by vulnerability management. The worst outcome is when nobody owns it.

08 / 31 · Definitions

Time to think · The PICERL cycle

Six phases. The cycle improves with every incident your team processes.

Time to think

Act 2 · Preparation

Before the alert fires

Preparation is everything you set up before an incident occurs — what counts as an incident, who responds, what tools they use, what criteria they follow.

Five severity levels is the standard: critical, high, medium, low, informational. One project used a 10-point scale and people couldn't distinguish a 6 from a 7. Simpler scales mean faster triage.

In distributed architectures, predicting impact is hard. If the payment service degrades, how many downstream services break? You often don't know until it happens.

Your role as a Champion during an incident: you provide product context — architecture, trust boundaries, data flows, recent changes. The IR team has the security expertise; you bridge the gap to the product. After the incident, you write or update the runbook for this incident type. You are not the incident commander — you are the product's representative in the response process.

10 / 31 · Preparation

Act 2 · Preparation

You can't prepare for everything

Some incident types aren't in your playbook because nobody imagined they could happen. A cloud provider taking down entire regions during an infrastructure update. A third-party SDK pushing a compromised version through a legitimate update channel.

Trying to cover every possible scenario from the start is impossible and counterproductive. It produces a 200-page document that nobody reads.

Start with your top 5 most likely incident types. Define procedures for those. After each real incident, add the new type. In one to three years, your playbook covers 95% of what actually happens — because it's built from reality.

11 / 31 · Preparation

★ Best Practice

Build from real incidents, not theory

★

Best Practice

Building your IR process from real incidents rather than theoretical scenarios

Start with five severity levels and five incident types. Define response procedures for those. Everything else gets the generic playbook until you have data.

After each real incident, update the playbook: add the new incident type, adjust criteria, tune thresholds. Review your plans annually — stored configurations, golden images, and backup procedures decay over time. The OS image you saved two years ago may not be available in your cloud provider anymore.

The process is never finished. It's iterated.

12 / 31 · Best Practice

Act 3 · Identification

Where signals come from

Incident signals arrive from three categories of sources:

Technical — monitoring alerts, SIEM, IDS/IPS, log analysis. The automated layer.

Human — a Slack message, a support call, a developer noticing something odd. Often the first signal before automated alerts fire.

External — a news article, a dark web monitoring alert, a security researcher's disclosure, a partner reporting anomalous traffic.

People are often the first responders — before any monitoring system fires. Your process needs a clear path for human-reported signals, not just automated ones.

13 / 31 · Identification

Act 3 · Identification

Same event, different causes

A customer calls: "I can't log in." This is an event with at least four possible explanations:

Forgot the password, wrong keyboard layout, caps lock. No incident — just a user having a bad morning.

Auth database full, login service crashed. An operational incident, but not a security one.

Account compromised, password changed by attacker. Security incident — containment required.

Someone with access reset the password inappropriately. A different kind of security incident with different containment.

One event. Four possible causes. The first responder needs written criteria to distinguish between them — not intuition.

14 / 31 · Identification

Time to think · Initial triage

The first responder follows the tree. Written criteria, not improvisation.

Time to think

Knowledge check · Triage

Knowledge check

Your monitoring shows 847 failed authentication attempts for an admin account in 30 minutes, from 12 different IP addresses. The account is not locked. What type of incident is this?

D. The brute force is happening now (indicator). If the account isn't locked and the password is weak, compromise is imminent (precursor). Immediate action: lock the account or enforce rate limiting.

16 / 31 · Quiz

Act 4 · Containment

The most important phase

Containment is the most counterintuitive phase. The instinct is to investigate — who did this? How did they get in? But investigation takes time, and the incident is ongoing during every minute of analysis.

The principle: containment takes priority over investigation. Predefined actions, executed immediately based on incident type, without waiting for a complete understanding of what happened.

If your containment procedure requires investigation before action, it's not a containment procedure — it's an investigation procedure with a containment label.

17 / 31 · Containment

Act 4 · Containment

Short-term, backup, long-term

Short-term — immediate actions to stop ongoing damage. Block the account, restrict access to trusted IPs, isolate the host. This buys time, not a permanent fix.

Backup — before changing anything, snapshot the current state. Disk images, database backups, config snapshots. Evidence you don't preserve now is evidence you can never analyze.

Long-term — fixes that allow systems to operate securely while investigation continues. Patching, account removal, network segmentation.

The best malware analysis source is a RAM dump. But shutting down a server destroys RAM. For most organizations, restoring business takes priority over forensic completeness. If evidence preservation is required, build memory dumps into the containment plan — before shutdown, not instead of it.

18 / 31 · Containment

Act 4 · Real case

Block everyone. Use the paper.

A private bank with fewer than 1,000 high-value clients had a radical containment approach for critical AD compromises.

🚫Slack

Compromised

📱Signal

Out-of-band

📞Phone tree

Pre-registered

Pattern

When the network is compromised, out-of-band communication saves you. Pre-register a backup channel (phone tree, Signal group) before you need it.

19 / 31 · War story

Time to think · Predefined containment

No investigation needed. No decision-making under pressure. The response was defined before the incident happened.

Champion's takeaway

If attackers are reading your Slack, your incident response is their intelligence feed. Assume breach in your communication planning.

Time to think

Act 4 · Real case

When attackers own the conversation

A large holding company's Active Directory was fully compromised — every user, every mailbox, every internal conversation was controlled by the attackers.

21 / 31 · War story

Time to think · The correct sequence

Investigation is important — but it comes after the bleeding stops, not before.

Time to think

Act 5 · Eradication

Remove, restore, verify

Eradication removes the threat and restores affected systems to a clean state. Two challenges are common:

You can't always restore to the same state — VM images get deprecated, OS versions reach end of life, cloud providers retire services. Your two-year-old backup might reference infrastructure that no longer exists.

And you need to check whether the issue exists in similar systems. An attacker who compromised one service probably probed others.

Document everything during eradication. What was removed, what was restored, what was rebuilt. If you don't document it now, critical details will be forgotten within days.

23 / 31 · Eradication

Act 5 · Recovery

Four numbers that define recovery

RPO — Recovery Point Objective. How much data loss is acceptable? Drives your backup frequency.

RTO — Recovery Time Objective. How long until systems are operational? Drives your infrastructure decisions.

SDO — Service Delivery Objective. The minimum service level that keeps the business alive.

MTO — Maximum Tolerable Outage. The hard ceiling. Beyond this, consequences become catastrophic.

A sports betting platform goes down. Full recovery takes days. But the SDO might be: a static web page saying "call this number" plus phone operators accepting bets. The business continues at reduced capacity while full recovery proceeds. Define your SDO before you need it.

24 / 31 · Recovery

Time to think · Recovery metrics

SDO gets the business running. RTO gets systems back. RPO gets data back. MTO is the ceiling you must beat.

Time to think

Knowledge check · Recovery

Knowledge check

Your e-commerce platform suffers a ransomware attack. Database backups are 6 hours old. Your MTO is 8 hours. What's the recovery approach?

D. First, reach the SDO: a static page with a phone number keeps the business alive within minutes. In parallel, restore from the 6-hour backup. You'll lose 6 hours of data, but business continues. Never negotiate with attackers without law enforcement guidance.

26 / 31 · Quiz

Act 6 · Lessons learned

The phase that closes the loop

The lessons learned meeting should happen within two weeks — while details are fresh. It answers: what was the scope, how effective was containment, what worked well, what was slow, and what changes would prevent recurrence?

The most important outcome isn't the report — it's the process changes that result. An updated playbook. A tuned monitoring rule. A new containment procedure. If lessons learned produces only a document, it failed.

Not every incident type can be prevented. When it can't, the goal shifts to reducing response time — detect earlier, contain faster, recover more efficiently. Improving from 4-hour response to 45-minute response is a meaningful security improvement even if the incident itself can't be prevented.

27 / 31 · Lessons learned

★ Best Practice

One channel, one ticket system

★

Best Practice

Routing all incident communication through a single channel and ticketing system

During an incident, fragmented communication is the most common process failure. Updates scattered across Slack channels, email threads, and war rooms that not everyone knows about.

Route all incident reports — from employees, customers, and monitoring systems — into a single ticketing system. This creates faster identification (all signals in one place), a searchable knowledge base over time, and a foundation for automation.

Over months and years, the documented procedures that accumulate in this system handle the vast majority of incidents — turning ad-hoc responses into repeatable processes.

Channel template: #inc-YYYY-MM-description. Pin: incident summary, severity, owner, current status. Update every 30 minutes. Tag every decision with timestamp and author. This format is directly usable — create it as a Slack template today.

28 / 31 · Best Practice

Summary · Part 1

What you covered

An event is an observable change. An incident is an event with negative business impact. Your monitoring coverage defines what you can protect.

The PICERL cycle provides structure: Prepare → Identify → Contain → Eradicate → Recover → Learn. Each real incident improves the next response.

Containment takes priority over investigation. Predefined actions execute immediately based on incident type.

Recovery has four metrics: RPO, RTO, SDO, and MTO. The SDO keeps the business alive while full recovery continues.

Lessons learned must produce process changes, not just documentation.

If your team interacts with external parties during incidents (regulators, partners, CERTs), two standards matter. NIST SP 800-61 provides the standard incident handling framework that most regulatory bodies expect. FIRST TLP (Traffic Light Protocol) classifies information by sharing scope: RED (named recipients only), AMBER (limited sharing), GREEN (community), CLEAR (public). Knowing these exists is enough for now — your security team will guide the specifics.

29 / 31 · Summary

Module 8a · Your results

Your Performance

Total XP

0

Rank

RECRUIT

Best streak

0

Reflections

0 / 2

30 / 31 · Results

Next · Part 2

Part 2 covers alert management, false positive reduction, runbooks and playbooks, triage processes, and automation.

31 / 31 · Bridge