chaos engineeringtestingreliability

Building a Windows Chaos Engineering Playbook: Process Roulette for Reliability Testing

UUnknown

2026-01-31

10 min read

A practical playbook for safe Windows 'process roulette' — controlled kills, ProcDump dumps, and measurable resilience testing for 2026.

Start here: why your Windows servers need a process-roulette playbook

You’re responsible for uptime, but every complex Windows deployment is a few unexpected process terminations away from a support page full of tickets. Traditional chaos engineering focuses on network partitions and CPU pressure; process roulette — controlled, randomized process termination — targets the real, common failure mode: a single process crash that cascades into downtime. This article gives a practical, safe, and repeatable playbook for running process-roulette experiments on Windows servers in 2026, with tooling, dump collection, monitoring, and metrics that let you prove and improve resilience.

The 2026 context: what’s different and why now

In late 2025 and into 2026, two trends made Windows process-level chaos far more relevant: first, enterprise Windows stacks are increasingly hybrid — Windows servers hosting modern .NET Core / .NET 8+ microservices and legacy Win32 services — meaning a single process failure can break multiple layers. Second, observability on Windows matured: OpenTelemetry native drivers, richer Event Tracing for Windows (ETW) integrations, and better exporters to Prometheus and cloud APMs make capturing impact fast and reliable. That combination makes process roulette both more actionable and safer than it used to be.

High-level playbook: phases and goals

Treat each experiment like a small incident. Use a repeatable lifecycle:

Plan — define hypothesis, blast radius, and safety controls.
Prepare — whitelist/blacklist processes, enable dumps, set observability hooks.
Execute — run controlled kills (staged, canary, then broader).
Observe — collect metrics, logs, and dumps in real time.
Analyze — triage dumps, traces, and SLO effects.
Remediate & Automate — implement fixes and automation for detection/recovery.

Phase 1 — Planning: hypothesis, blast radius, and safety

Every experiment needs a crisp hypothesis. Example: “If the front-end process P crashes, our API fails gracefully and calls are retried within 30s, maintaining 99% success for checkout.”

Define success criteria (SLOs, latency, error rate, business transactions).
Define blast radius: single canary VM, half a pool, or entire cluster.
Safety gates: human-in-the-loop abort, automatic abort on SLO violation, and a hard blacklist of system-critical PIDs/names.

Critical Windows processes to never terminate

Whitelisting is easier than blacklisting. If you must blacklist, make it comprehensive. Never target these core processes:

csrss.exe, wininit.exe, services.exe, lsass.exe, smss.exe, svchost.exe with system services (unless you know exactly which service)
Hypervisor or host agent processes (hyperv, vmhost), backup agents, and endpoint security products unless explicitly covered

Phase 2 — Prepare: tooling, dump collection, and instrumentation

Preparation reduces risk and maximizes learning. Equip each test host with three capabilities: controlled-kill tooling, reliable dump collection, and observability plumbing.

Choose safe tooling for controlled kills

Options:

PowerShell (Stop-Process -Id or -Name) — fine for scripted, targeted kills.
taskkill — built-in, can send WM_CLOSE or force ( /PID, /T, /F ).
Third-party agents — Gremlin (commercial) supports Windows agents that offer safe experiment orchestration and built-in blast-radius controls. Evaluate for your environment. Also consider custom, lightweight agents modeled as small micro-apps that accept experiment plans and respect whitelist/blacklist.
Custom agent — lightweight PowerShell/Go agent that accepts experiment plans and respects whitelist/blacklist.

Always implement a dry-run mode that logs targeted PIDs and resolves names to ensure you don’t accidentally target grouped system services.

Crash dump collection: ProcDump, LocalDumps, and managed tools

Capturing a crash dump before or as part of a kill is essential to root cause failures. Recommended tools:

ProcDump (Sysinternals) — the de facto tool for on-demand dumps. Use procdump -ma <PID> <path> to create a full-memory dump.
Windows Error Reporting (WER) LocalDumps — configure via registry to automatically create dumps on unhandled exceptions for specific processes.
dotnet-dump and dotnet-counters — for managed .NET apps, useful alongside ProcDump.

Sample safe sequence (PowerShell): capture a live full dump, then terminate:

# Accept Sysinternals license once
procdump -accepteula -ma $pid C:\dumps\$($process.ProcessName)_beforekill.dmp
Stop-Process -Id $pid -Force

For production-safety, run procDump with file rotation and a retention policy, and ship dumps to an analysis server or cloud storage. See our notes on observability pipelines and incident response for ideas about retention and shipping.

Instrumentation: metrics, traces, and logs

Install and configure the following before running experiments:

windows_exporter for Prometheus to collect process CPU, memory, handle counts, and restart counts.
OpenTelemetry .NET and native SDKs to capture traces and spans across services.
Event Log forwarding and ETW traces for system and application events.
Application-level health checks (HTTP endpoints, synthetic transactions) reported to your monitoring system.

Phase 3 — Execute: controlled, staged process roulette

Execution is where the playbook earns its keep. Follow a staged approach:

Canary: one non-production host or a single canary VM in production traffic-similar environment.
Small cohort: 5–10% of the pool, automated with runbook and alerting checks.
Full fleet: only if prior stages meet success criteria.

Sample PowerShell experiment runner (safe)

Below is a simplified, annotated experimental runner. It shows the essential safety features: whitelisting, pre-dump, dry-run, and observability hooks.

# experiment.ps1
param(
  [string]$ProcessNamePattern = "MyApp*",
  [switch]$DryRun = $true,
  [int]$MaxTargets = 3
)

$blacklist = @('csrss','wininit','services','lsass','smss')
$targets = Get-Process | Where-Object { $_.ProcessName -like $ProcessNamePattern } | Select-Object -First $MaxTargets

foreach($p in $targets) {
  if($blacklist -contains $p.ProcessName.ToLower()) { Write-Warning "Skipping system process $($p.ProcessName)"; continue }
  Write-Output "Targeting PID $($p.Id) $($p.ProcessName)"
  if($DryRun) { continue }

  # collect full dump
  & procdump -accepteula -ma $p.Id "C:\dumps\$($p.ProcessName)_$($p.Id)_$(Get-Date -Format yyyyMMddHHmmss).dmp"

  # notify monitoring (example: send HTTP request to alerting endpoint)
  Invoke-RestMethod -Uri 'http://monitoring/_chaos/start' -Method Post -Body (@{pid=$p.Id;name=$p.ProcessName} | ConvertTo-Json)

  # controlled kill
  Stop-Process -Id $p.Id -Force
  Start-Sleep -Seconds 10

  Invoke-RestMethod -Uri 'http://monitoring/_chaos/end' -Method Post -Body (@{pid=$p.Id;name=$p.ProcessName} | ConvertTo-Json)
}

Phase 4 — Observe: what to measure (and how to collect it)

Make observability central to the experiment. Measure both system and business-level impact in real time.

Essential metrics

Process-level: restart count, uptime, CPU, working set, handle counts, thread counts.
Host-level: CPU steal, memory pressure, paging, disk IO, network errors.
Application-level: request success rate, p99 latency, error rate, queue length.
Business-level: transaction throughput, revenue events, checkout success.
SRE metrics: MTTR, MTBF, and error budget burn during the experiment window (see our operations and tooling playbook for aligning SRE KPIs with toolchains).

Telemetry pipeline checklist

Time-synced clocks (NTP) across hosts.
High-cardinality tags: host, service-version, experiment-id, canary.
Alerting rules to auto-abort if critical SLOs break.
Correlation IDs propagated in traces so a crash event maps to business traces.

Phase 5 — Analyze: dumps, traces, and failure modes

After an experiment, analysis is where you convert chaos into improvements. Prioritize as follows:

Crash dumps and stack traces
Trace spans for the transaction path
Time-series metrics for host and process behavior

Quick dump triage workflow

Locate dump and confirm timestamps align with experiment window.
Pull symbols from Microsoft public symbol server and your private symbols store.
For native code use WinDbg/WinDbg Preview; for managed .NET use dotnet-dump or SOS in WinDbg.
Extract thread stacks, exception records, and GC heap state for .NET.
Document root cause hypothesis and severity (fixable bug, resource leak, poor retry logic, etc.).

Phase 6 — Remediate, automate, and close the loop

Fixes after experiments fall into three buckets: code fixes, operational controls, and monitoring automation.

Code: add exception handling, safer shutdown paths, or circuit breakers.
Operational: add process supervisor (Windows Service recovery options, Windows Service restart policies, or a watchdog service), adjust service-scoped process isolation using Job Objects if needed.
Monitoring automation: detect repeated crashes and auto-rollback deployments or scale out healthy instances.

Automated recovery examples

Implement quick wins: use the Service Control Manager (SCM) recovery options for services, or a supervisor process that restarts ephemeral worker processes. For containerized Windows workloads, rely on container orchestrator restart policies to reduce manual toil. See also notes on shift-left practices that help teams bake recovery into CI and developer workflows.

Advanced strategies and 2026 trends to adopt

As of 2026, the best teams combine process roulette with these strategies:

Shift-left chaos: run reduced-scope process-kill tests in CI pipelines for PRs that touch resilience-sensitive code.
Observability-as-code: define required metrics/traces as part of the test definition so experiments fail if telemetry is missing.
OpenTelemetry on Windows: instrument both native and managed flows to get end-to-end traces across mixed stacks.
Policy-driven blast radius: use policies (Rego or built-in Gremlin policies) to auto-enforce never-kill lists and rollback triggers.

Real-world example (short case study)

At a global e-commerce company in 2025, a single search front-end process would sometimes crash under memory pressure, causing a cascade of timeouts across regional caches. The team adopted a process-roulette playbook: they ran controlled kills in a canary region, captured dumps with ProcDump, and found an unmanaged memory leak in native image generation. The fix reduced crash frequency by 90% and improved MTTR via an automated supervisor that restarted workers and pushed an alert with the dump URL into the SRE on-call workflow.

"The combination of pre-kill dumps and canary experiments gave us actionable evidence to ship a targeted fix in two sprints instead of chasing noisy stack traces for months."

Runbook template: experiment checklist

Use this minimal checklist before hitting RUN:

Document hypothesis and SLOs.
Pick canary hosts and deploy instrumentation agents.
Verify dumps are enabled and shipping (run a quick procdump).
Confirm blacklist of system-critical processes.
Set automated abort thresholds in monitoring.
Dry-run the list of targeted PIDs/names and verify via team review.
Schedule a postmortem slot and incident owner.

Security, legal, and compliance considerations

Ensure you have buy-in from security and compliance teams before running experiments that touch logs, crash dumps (which can contain PII), or endpoint protection. Always encrypt dumps in transit and at rest, redact PII from dumps when required, and maintain an access control policy for who can download and analyze crash dumps.

Common pitfalls and how to avoid them

Targeting the wrong process — mitigate with dry-run and detailed name-to-PID resolution.
Insufficient observability — require instrumentation before experiments are allowed to run.
Missing symbols — maintain a private symbol server and document symbol collection steps.
Blast radius too large — always start tiny and have auto-abort on SLO breach.

Key takeaways

Process roulette is a high-value, low-cost experiment for Windows stacks when done safely.
Pre-kill dumps (ProcDump / LocalDumps) are non-negotiable — they turn transient crashes into actionable fixes.
Observability and SLO-based aborts make experiments safe to run in canary production.
Automate the loop: detection, auto-collect dumps, restart, and a postmortem pipeline to close gaps.

Next steps: ready-made checklist to start your first experiment

Install ProcDump and windows_exporter on a canary VM.
Configure LocalDumps for your target process and test dump creation.
Create the PowerShell runner with dry-run enabled and a conservative blacklist.
Define SLOs and auto-abort thresholds, then run a single canary experiment.
Analyze dumps and integrate fixes into your sprint backlog.

Call to action

Ready to reduce downtime by proving failure modes before they hit customers? Start with one canary experiment this week: enable ProcDump, instrument with OpenTelemetry, and run a dry-run process roulette to validate targets. If you want, download our reproducible PowerShell runner and checklist to get started (link in the engineering repo) — then iterate, measure, and harden. Share results with your team and make process roulette part of your regular resilience cadence.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.