Module 8b · Operationalizing IR
Security Champions · Module 8b
Operationalizing Incident Response
From theory to daily practice. Alert management, triage, runbooks, and automation. Part 2 of 2.
1,247 ALERTS / DAY TRIAGE 23 REAL
Cover
Act 1 · Alert management
The alert problem
A mature monitoring setup generates hundreds or thousands of alerts per day. Most are noise — expected behavior, known patterns, transient spikes that resolve on their own.
The danger isn't too few alerts. It's too many. When every alert looks the same, the real ones disappear into the flood. This is alert fatigue, and it's the number one reason real incidents get missed.
If your team ignores alerts because there are too many, the monitoring system is working against you. More coverage with no triage produces less security, not more.
02 / 25 · Alerts
Act 1 · Alert management
Reducing false positives
False positive reduction is an ongoing process, not a one-time configuration. After every incident review, ask: which alerts fired that shouldn't have? Which patterns can be whitelisted?
Before tuning alerts, establish what "normal" looks like. A spike in failed logins at 9 AM Monday is normal — people forgot passwords over the weekend. The same spike at 3 AM Saturday is not. Without a baseline, every anomaly looks like an incident.
Track which alert rules generate the most false positives over a month. Usually 20% of your rules generate 80% of the noise. Fix those first. A single well-tuned rule change can eliminate hundreds of daily false positives.
The goal isn't zero false positives — that would mean missing real incidents. The goal is a manageable signal-to-noise ratio where your team can investigate every alert that fires.
03 / 25 · Alerts
Time to think · The alert funnel
RAW ALERTS 1,247 per day · SIEM, IDS, WAF, endpoint, cloud DEDUPLICATED & CORRELATED ~180 unique events · grouped by source, type, timeframe TRIAGED ~45 events · filtered by known-good baselines ACTIONABLE 5–10 incidents · require human response

From 1,247 raw alerts to 5–10 that need a human. Each layer is a process you design.

Time to think
Best Practice
Treat alert rules like code
Best Practice
Maintaining alert rules with the same discipline as production code
Alert rules should be version-controlled, reviewed, and tested. When you add a new alert rule, document: what it detects, what the expected false positive rate is, and who owns tuning it.

Every alert that fires should have a documented response — even if that response is "verify and close." An alert without a documented response is an alert that trains your team to ignore alerts.

Review alert rules quarterly. Delete rules that haven't fired in 90 days or that produce only false positives. Dead rules create the illusion of coverage.
05 / 25 · Best Practice
Act 2 · Triage
First response: triage
Triage is the first-contact process: an alert fires, someone looks at it, and within minutes they decide if it's a real incident or noise.
The triage operator needs three things: context (what system, what behavior), criteria (written rules for what constitutes an incident), and authority (permission to escalate without asking for approval).
Without written criteria, triage depends on individual judgment. Individual judgment varies by experience, by shift, by how tired someone is at 3 AM. Written criteria make triage consistent regardless of who's on rotation.
The SLA clock starts when the alert fires, not when someone notices it. If your triage process adds 45 minutes before anyone looks at the alert, your effective response time includes those 45 minutes.
06 / 25 · Triage
Act 2 · Triage
On-call rotation
A 24/7 incident response capability requires an on-call rotation. The mechanics matter more than people think.
One-week rotations are standard. Shorter rotations lose context; longer ones cause burnout. The handoff meeting between rotations is where most information gets lost — document the active investigations, not just the resolved ones.
The on-call person needs a clear escalation path: who to call if they can't resolve the issue, what's the backup if the primary escalation contact doesn't respond within 15 minutes, and when does management need to be notified. Write this down and test it quarterly.
On-call work is real work. Teams that don't acknowledge this with compensatory time off, reduced sprint commitments, or explicit recognition will see their best people avoid the rotation. The result: your least experienced people handle the most critical moments.
07 / 25 · Triage
Time to think · Triage workflow
🔔 ALERT FIRES SLA clock starts 👤 ON-CALL Reviews within SLA FALSE POSITIVE Close + tune rule REAL INCIDENT Create ticket ESCALATE Assign to IR team Begin containment 0 MIN ≤ SLA DECISION RESPONSE

Every alert follows the same path. Written criteria at the decision point keep triage consistent.

Time to think
Act 3 · Runbooks & Playbooks
Runbooks vs playbooks
These terms are often used interchangeably, but they solve different problems.
A runbook is a step-by-step procedure for a specific, known scenario. "If X happens, do Y, then Z." No judgment required. A junior operator at 3 AM should be able to follow it and produce the correct result.
A playbook is a broader strategy for a category of incidents. "For ransomware incidents, the priorities are: isolate affected systems, preserve evidence, assess scope, then begin recovery." It requires judgment about which specific steps apply to this particular case.
Start with runbooks. For your top 5 incident types, write the exact steps. Playbooks emerge naturally as you accumulate runbooks and notice patterns across them.
09 / 25 · Runbooks
Act 3 · Runbooks & Playbooks
Writing runbooks that work at 3 AM
The test for a runbook is simple: can someone who has never seen this incident type before follow it successfully at 3 AM, under pressure, with no one to ask?
If the answer is no, the runbook isn't finished.
Bad: "Verify the affected systems." Good: "Run kubectl get pods -n production | grep CrashLoopBackOff and note the pod names." Every step should include the exact command, the expected output, and what to do if the output doesn't match.
Every containment step should include its reversal. If step 3 is "block IP range X at the firewall," the runbook should say how to unblock it when containment ends. Containment actions without documented reversals become permanent configuration drift.
10 / 25 · Runbooks
Time to think · Runbook anatomy
RUNBOOK: BRUTE FORCE TRIGGER >100 failed logins / 10min / single account STEPS 01 Lock account immediately 02 Check if login succeeded before lockout 03 Block source IPs at WAF ROLLBACK ↩ Unlock account + remove IP blocks ESCALATE IF Login succeeded → full account compromise runbook

Trigger, steps, rollback, escalation. Every runbook has the same four sections.

Time to think
Knowledge check · Alert management
Knowledge check
Your SIEM generates 2,000 alerts per day. Your team investigates about 50 and closes the rest without review. A post-mortem reveals a real breach was in the uninvestigated pile for 3 days. What's the most effective first step?
B. Tuning the noisiest rules is the highest-leverage first step. It reduces volume immediately, improves signal-to-noise ratio, and makes the remaining alerts more actionable. Hiring more people (A) doesn't fix the underlying noise problem. Reducing coverage (C) creates blind spots. AI correlation (D) can help later, but without clean rules it automates bad decisions.
12 / 25 · Quiz
Act 4 · Communication
Where information flows during an incident
During an incident, information fragmentation is the most common process failure. Updates in Slack, decisions in email, status calls that not everyone attends, a shared doc that three people update simultaneously.
Define one primary channel before the incident happens. All updates, decisions, and status changes go there. Secondary channels can exist for deep technical discussion, but the primary channel is the source of truth.
Create a new Slack/Teams channel for each major incident: #inc-2026-05-ransomware. Pin the incident summary at the top. All participants join the channel. When the incident closes, the channel becomes an archive — a complete timeline of what happened, who decided what, and when.
If your company's Slack or email is the compromised system, you need a pre-arranged fallback: a Signal group, personal phone numbers for key people, or a physical meeting point. Document this in a printed card that team members keep at home.
13 / 25 · Communication
Time to think · Communication hierarchy
PRIMARY CHANNEL #inc-2026-05-ransomware TECH DISCUSSION Deep debugging MGMT UPDATES Business impact EXTERNAL COMMS Customers, regulators ↑ decisions flow back to primary FALLBACK CHANNEL Signal group / phone tree / in-person

One source of truth. Secondary channels feed into it. Fallback defined before you need it.

Time to think
Act 5 · Automation
What to automate, what to keep human
Automation in incident response follows a clear hierarchy: automate the detection, automate the enrichment, semi-automate the containment, and keep the decision human.
Detection automation is your monitoring and alerting — already automated in most organizations. Enrichment automation gathers context: who owns the affected system, what's the recent change history, is the source IP known-bad.
Containment automation is where it gets nuanced. Automatic account lockout on brute force? Usually safe to automate. Automatic network isolation of a production server? That needs human approval — the blast radius is too high.
The rule: automate actions that are safe to get wrong. Locking an account incorrectly is a 5-minute fix. Isolating the wrong production server is a revenue-impacting outage.
15 / 25 · Automation
Act 5 · Automation
SOAR: connecting the tools
SOAR platforms connect your monitoring, ticketing, and response tools into automated workflows. An alert fires → a ticket is created → threat intelligence is queried → enrichment data is attached → the on-call person receives a notification with full context.
Without SOAR, the on-call person does all of this manually: open the SIEM, check the IP reputation, look up the asset owner, create a ticket, copy the details. This takes 15–20 minutes per alert. With SOAR, it takes seconds.
You don't need a full SOAR platform to start automating. A Python script that queries your SIEM API and creates a Jira ticket with enrichment data is a perfectly valid first step. Build the workflow manually first, then automate the steps that are repetitive and low-risk.
16 / 25 · Automation
Time to think · Automation layers
DETECTION SIEM rules, IDS signatures, anomaly detection AUTOMATED ENRICHMENT IP reputation, asset owner, recent changes, threat intel AUTOMATED CONTAINMENT Account lockout, IP blocking, network isolation SEMI-AUTO DECISION Severity classification, scope assessment, business impact HUMAN SAFE TO GET WRONG → AUTOMATE · HIGH BLAST RADIUS → HUMAN

Automate what's safe to get wrong. Keep humans on decisions with high blast radius.

Pattern
When automation fails, the runbook is your fallback. A runbook that hasn't been tested is a runbook that doesn't work. Test quarterly.
Time to think
Act 5 · Real case
When the automation fails
A fintech company had a well-designed SOAR workflow: brute force detection → automatic account lockout → ticket creation → notification.
18 / 25 · War story
Best Practice
Test the process, not the tools
Best Practice
Running tabletop exercises quarterly to test your IR process
A tabletop exercise is a simulated incident where the team walks through their response process without touching real systems. The facilitator presents a scenario, the team responds as they would in a real incident, and gaps become visible.

Common discoveries: the escalation contact list is outdated, the runbook references a tool the team no longer uses, nobody knows the process for external communication, the backup restoration was never tested.

Run one per quarter. 60–90 minutes. No preparation required from participants. The facilitator prepares the scenario. The team brings their actual process.
19 / 25 · Best Practice
Best Practice
Measuring what matters
Best Practice
Tracking four IR metrics that actually drive improvement
MTTD — Mean Time to Detect. How long from when the incident started to when you knew about it. Drives investment in monitoring.

MTTR — Mean Time to Respond. How long from detection to first containment action. Drives investment in triage process and runbooks.

False Positive Rate. What percentage of triaged alerts turn out to be non-incidents. Drives alert tuning priorities.

Recurrence Rate. What percentage of incidents are repeat types. Drives lessons-learned effectiveness. If the same incident type recurs, your post-incident improvements aren't working.
20 / 25 · Best Practice
Time to think · IR metrics
MTTD · MEAN TIME TO DETECT 4.2h → target: <1h Invest in: monitoring coverage MTTR · MEAN TIME TO RESPOND 38m → target: <15m Invest in: triage + runbooks FALSE POSITIVE RATE 72% → target: <30% Invest in: alert tuning RECURRENCE RATE 18% → target: <5% Invest in: lessons learned

Each metric points to a specific investment. Measure what drives improvement, not what's easy to count.

Time to think
Knowledge check · Automation
Knowledge check
Your team is deciding which IR actions to automate first. Which of the following is the safest candidate for full automation?
C — Enrichment. Enrichment adds information to an alert without changing any system state. If the enrichment data is wrong, no damage occurs — someone just gets inaccurate context. Database isolation (A) and cluster shutdown (D) have high blast radius. Executive notification (B) is safe but low value — the real win is automating the data gathering, not the notification.
22 / 25 · Quiz
Summary · Part 2
What you covered
Alert fatigue is the number one reason real incidents get missed. Tune the noisiest 20% of rules first to reduce 80% of the noise.
Triage requires written criteria, defined SLAs, and clear escalation paths. The on-call person should never have to improvise the process.
Runbooks are step-by-step procedures for known scenarios. They include trigger, steps, rollback, and escalation. Playbooks are broader strategies that emerge from accumulated runbooks.
Automate what's safe to get wrong. Detection and enrichment are fully automatable. Containment is semi-auto. Decisions stay human.
Track MTTD, MTTR, false positive rate, and recurrence rate. Each metric points to a specific investment area.
23 / 25 · Summary
Results · Your score
Your results
Here's how you did across this module.
Total XP
0
Rank
RECRUIT
Best streak
0
Reflections
0
Not happy with your score? You can retake the module to improve.
24 / 25 · Results
Module 8 · Complete
Module 8 · Complete
Both parts done.
You now have the full incident response picture: the PICERL cycle from Part 1, and the operational processes — alert management, triage, runbooks, automation, and metrics — from Part 2. The next time an alert fires, you'll know not just what to do, but why the process is designed that way.
MODULE COMPLETE
25 / 25 · Complete
Retake module
Reset all progress and start over?
Your XP, streak, quiz answers, and wagers will be cleared. This cannot be undone.
Reflect