Create a Windows Service Watchdog Instead of Letting Random Killers Crash Your Systems
monitoringservicesPowerShell

Create a Windows Service Watchdog Instead of Letting Random Killers Crash Your Systems

wwindows
2026-02-01 12:00:00
11 min read
Advertisement

Prevent accidental or malicious process kills: implement SCM recovery, NSSM wrappers, and PowerShell watchdogs for resilient Windows services.

Stop random killers from wrecking uptime — design a Windows service watchdog that actually works

When a random process-killer or a careless admin slams Taskkill on a production box, you shouldn't be rebuilding state and replaying logs. In 2026, endpoints are more automated and more targeted than ever — AI-driven scripts, aggressive chaos tools, and misconfigured maintenance jobs can all terminate critical processes. The right answer is a resilient, Windows-native watchdog strategy that uses the Service Control Manager (SCM), NSSM (Non‑Sucking Service Manager), scheduled tasks, and PowerShell to detect failures, restart cleanly, and raise alarms — without turning your server into a crash-happy hairball.

What you'll get from this guide

  • Actionable patterns to make apps behave like proper Windows services
  • Practical examples: sc.exe recovery settings, NSSM wrappers, PowerShell watchdog script, and Task Scheduler triggers
  • Hardening tips to reduce accidental or malicious process termination
  • Production-ready strategies: backoff, flapping protection, health checks, and alerting integration

Linux admins often rely on systemd’s watchdog and restart policies; Windows has equivalent, battle‑tested tools, but they’re spread across SCM, third‑party wrappers, and automation platforms. Since late 2024 and into 2025 we’ve seen two important trends that affect how we design watchdogs:

  • Endpoint tooling and policy engines (Intune, Microsoft Defender for Endpoint, WDAC/AppLocker) have become more central to preventing unauthorized tools — but they’re not perfect. Process‑killing tools (some experimental or hobbyist tools have been covered in mainstream press) still appear in enterprise environments, especially in labs or misconfigured endpoints.
  • Observability and automation matured into AIOps workflows by 2026 — alerts can now automatically create remediation actions if you provide reliable restart semantics and health signals from your services.
Designing a watchdog today means thinking beyond "restart on crash" — add health probes, rate‑limit restarts, and route notifications into your observability pipeline.

Architectural patterns for a robust Windows watchdog

Choose one or combine multiple patterns depending on your application and risk profile:

  • Native SCM service — If you can run your app as a true Windows service (compiled with service entry point or run via srvany/NSSM), configure SCM recovery actions to auto-restart and log failures.
  • Service wrapper (NSSM) — For legacy apps or console programs, use NSSM to wrap the process and provide built‑in restart and stdout/stderr capture.
  • Scheduled-task watchdog — Use Task Scheduler with event triggers (service failed, process exit) to execute a PowerShell remediation script.
  • External monitor + remediation — Use a central controller (WAF, WEF, self-hosted controllers, AIOps) to watch health telemetry and call recovery actions via WinRM over HTTPS.

1) Configure Service Control Manager recovery (fast, native)

If your application can be a Windows service (compiled with service entry point or run via srvany/NSSM), always start with SCM recovery options. SCM supports retries and can run a command on failure.

Set up automatic restart using sc.exe

Use sc to set failure actions. The syntax is terse but powerful. Example: restart three times with 5s delay, then run a notification command.

sc failure "MyService" reset= 86400 actions= restart/5000/restart/5000/restart/5000

To set a command after failure (for example to call a script that sends alerts):

sc failureflag "MyService" 1
sc qtriggerinfo "MyService"
sc failure "MyService" actions= restart/5000/restart/5000/restart/5000/run/60000

Notes:

  • reset= is the failure count reset interval (seconds).
  • actions= takes pairs like restart/delay; some older Windows versions accept run/ for running a command.
  • SCM restart is immediate and handled in kernel/service manager — it's the most reliable first line of defense.

Use the Services MMC when you need a GUI

Services (services.msc) → right‑click your service → Recovery tab. Configure First/Second/Subsequent failures, set restart service, and add a program to run for Failed action. This is equivalent to sc's actions and good for one‑off configuration.

2) NSSM — reliably wrap non‑service binaries

NSSM (Non‑Sucking Service Manager) remains a pragmatic tool in 2026 for turning GUI/console apps into proper services with stdout/stderr capture and restart handling. It's still widely used in ops shops where refactoring into a service isn't feasible.

Install and configure NSSM

:: place nssm.exe on the PATH or C:\tools\nssm\nssm.exe install MyAppService
:: set application path and args via NSSM GUI or CLI
nssm install MyAppService "C:\Apps\myapp.exe" "--serve --port 8080"
nssm set MyAppService AppDirectory C:\Apps
nssm set MyAppService AppStdout C:\Logs\myapp.out.log
nssm set MyAppService AppStderr C:\Logs\myapp.err.log
nssm set MyAppService Restart  "on"
nssm start MyAppService

NSSM supports advanced options: AppExit handlers, environment variables, and graceful stop timeouts. Use NSSM when you need to capture logs reliably and want a service wrapper without rewriting the app.

3) PowerShell watchdog — flexible, policyable, and automatable

Wrap monitoring logic in a lightweight script that runs as a service (via NSSM) or as a scheduled task. This lets you implement health probes (HTTP/TCP), exponential backoff, flapping detection and integration with modern alerting APIs (Teams, PagerDuty, SMTP, or an observability endpoint).

Core PowerShell watchdog (simple, production‑ready)

# C:\scripts\watchdog.ps1
param(
  [string]$ProcessName = 'myapp.exe',
  [int]$MaxRestartsPerHour = 5,
  [int]$RestartDelaySeconds = 5,
  [string]$HealthUrl = 'http://localhost:8080/health'
)

function Send-Alert($msg) {
  # Simple SMTP or webhook integration placeholder
  Write-EventLog -LogName Application -Source Watchdog -EntryType Warning -EventId 5000 -Message $msg
}

$restartCount = 0
$history = New-Object System.Collections.Generic.Queue[datetime]

while ($true) {
  $p = Get-Process -Name $ProcessName -ErrorAction SilentlyContinue
  $healthy = $false

  if ($p) {
    try {
      $r = Invoke-WebRequest -Uri $HealthUrl -UseBasicParsing -TimeoutSec 3
      if ($r.StatusCode -eq 200) { $healthy = $true }
    } catch { $healthy = $false }
  }

  if (-not $p -or -not $healthy) {
    # cleanup flapping history
    $now = Get-Date
    while ($history.Count -gt 0 -and ($now - $history.Peek()).TotalHours -ge 1) { $history.Dequeue() }

    if ($history.Count -ge $MaxRestartsPerHour) {
      Send-Alert("Service $ProcessName is flapping — restart threshold exceeded")
      Start-Sleep -Seconds 300
      continue
    }

    $history.Enqueue($now)
    Write-Output "[Watchdog] Restarting $ProcessName"

    # attempt restart
    try {
      Start-Process -FilePath "C:\Apps\$ProcessName" -WorkingDirectory "C:\Apps" -ArgumentList '--serve' -NoNewWindow
      Start-Sleep -Seconds $RestartDelaySeconds
    } catch {
      Send-Alert("Failed to start $ProcessName: $_")
    }
  }

  Start-Sleep -Seconds 10
}

Wrap this script with NSSM or a scheduled task that runs at startup with highest privileges. The script shows key resilience features: health probe, restart rate limit (flapping protection), and alerting hook points.

4) Event-driven Task Scheduler watchdog (no service required)

You can create scheduled tasks triggered by Windows Event Log entries. This is useful when you cannot run a persistent service but want immediate remediation after a failure.

Example: restart on SCM event (7034 / 7031)

  1. Open Task Scheduler → Create Task.
  2. Triggers → New → Begin the task: On an event → Log: System → Source: Service Control Manager → Event ID: 7034 (service terminated unexpectedly).
  3. Actions → Start a program: powershell.exe -File C:\scripts\svc_remediate.ps1 -ServiceName MyService
  4. Conditions → Run whether user is logged on or not. Use highest privileges.

This approach ensures an immediate remediation action when SCM logs a service termination. Pair it with a small script that checks health and restarts safely.

Hardening: stop attackers and accidents from killing your watchdog

Watchdogs are only as good as their ability to run uninterrupted. Include these hardening steps:

  • Ephemeral admin access — follow least privilege; don't run everything as LocalSystem unless necessary.
  • AppLocker/WDAC policies — explicitly allow only approved process management tools; block rogue tools like ad hoc process-killers. In 2025–2026, WDAC has become more widely used in regulated environments for process allowlisting.
  • Tamper protection — ensure EDR/Defender tamper protection is enabled and exclude only when necessary. Use MDE (Microsoft Defender for Endpoint) baseline policies where possible.
  • Registry/ACLs — protect service registry keys and critical scripts with ACLs to prevent unauthorized modification.
  • Detect and limit Taskkill/Stop-Process — enable process creation/termination auditing (ETW / Audit Process Termination) and ship logs to your SIEM for alerting. Forward these logs into your observability pipeline.

Operational patterns: make restarts safe and observable

In production you must avoid cycles of rapid restarts (flapping), unnecessary restarts when external dependencies failed, and silent failures. Implement:

  • Health checks: HTTP readiness and liveness endpoints, port checks, and dependency probes (DB, caches). Tie these probes into your observability dashboards.
  • Exponential backoff and cool‑off: after N restarts increase delay and notify operators.
  • Notifications: Event Log entries + webhook to PagerDuty/Teams/Slack. In 2026, many observability tools will accept auto‑remediate webhooks — integrate thoughtfully.
  • Rate limits: limit restarts to avoid cascading outages.
  • Health‑aware restarts: attempt graceful shutdowns, dump diagnostics (proc dumps, traces) before restart and archive them to a secure store (see Zero‑Trust Storage) to speed root cause analysis.

Deployment & automation (PowerShell, Intune, GPO, CI/CD)

Make your watchdog reproducible and auditable:

  • Bundle NSSM, service installers, and scripts into an MSIX or package and deploy via Intune or SCCM.
  • Use PowerShell DSC, Desired State Configuration, or a declarative ARM/WinRM pipeline task in your CI/CD to keep service settings (SCM recovery, ACLs) consistent.
  • Example PowerShell to install NSSM service and set SCM recovery using sc.exe in a deployment script:
# Example deploy snippet
Copy-Item -Path .\nssm.exe -Destination C:\tools\nssm\ -Force
C:\tools\nssm\nssm.exe install MyAppService C:\Apps\myapp.exe --serve
sc failure "MyAppService" reset= 3600 actions= restart/5000/restart/5000/restart/5000

Advanced: circuit breakers and sidecar monitors

For microservices or critical daemons, consider a sidecar watchdog model (a small service that supervises the main process). This lets you:

  • Perform graceful shutdowns for stateful apps before restart
  • Collect debug artifacts and memory dumps on each failure
  • Implement more sophisticated policy engines (circuit breaker, canary restarts)

In Windows environments, a sidecar can be deployed as its own service (or wrapped with NSSM) and communicate with the main process via named pipes, local TCP, or performance counters.

Case study: recovering a legacy service from random kills

We inherited a Windows server running a single legacy process that repeatedly died during ad hoc stress tests — sometimes from a faulty admin script that used taskkill recursively. Rewriting the app wasn't feasible. The solution:

  1. Wrapped the binary with NSSM and enabled AppStdout/AppStderr logging.
  2. Configured SCM recovery using sc failure so the service auto‑restarted three times, then ran a script to create a proc dump and create a high‑severity incident ticket.
  3. Deployed a sidecar PowerShell watchdog to perform HTTP health checks, apply exponential backoff, and call our incident API.
  4. Locked down the host with AppLocker so ad hoc process killers were not permitted on production images.

Result: 99.95% uptime increase for that workload and a drastic fall in blind restarts — we only restarted when health checks failed and a dump was captured for analysis.

What to avoid — common pitfalls

  • Don't use RtlSetProcessIsCritical to make a process "critical" — it triggers a blue screen (BSOD) when killed. It's a poor substitute for graceful recovery and causes bigger outages.
  • Avoid silent restarts — always capture diagnostics, log events, and notify operators before or immediately after restart. Archive those artifacts into a secure store (see Zero‑Trust Storage).
  • Don't ignore dependencies — restarting a service that relies on an unreachable database only creates churn. Health checks are required.

Future predictions (2026+): what to prepare for

Expect these trends to shape watchdog design going forward:

  • Policy-driven protection: Intune, Endpoint Configuration Manager and WDAC/AppLocker will be heavily used to prevent rogue process-killers in production fleets.
  • AIOps remediation chains: Observability platforms will increasingly call remediation playbooks. Design your watchdog to provide crisp signals and safe remediation endpoints (see observability playbooks).
  • More third‑party wrappers: Tools like NSSM will continue to be useful, but expect enterprise-grade service managers and containerized Windows workloads to take the lead in greenfield projects.

Checklist — deploy a production Windows watchdog

  1. Decide: native service vs NSSM wrapper vs scheduled task.
  2. Configure SCM recovery (sc failure) for immediate restarts.
  3. Implement a PowerShell watchdog for health checks, backoff, and alerting.
  4. Capture logs and dumps on failures before restarting and archive them in a secure Zero‑Trust Storage.
  5. Protect watchdog processes with AppLocker/WDAC and EDR tamper protection.
  6. Automate deployment via Intune/SCCM/CI pipelines and enforce config drift detection.
  7. Integrate alerts into observability/AIOps workflows for automated or human remediation.

Final takeaways

Random process killers — whether malicious, accidental, or experimental — are a real operational risk. In 2026, the correct response is not a one‑line restart; it’s a layered watchdog strategy: run critical apps as services or wrapped by NSSM, use SCM recovery rules, add health‑aware PowerShell watchers or Task Scheduler triggers, and harden the platform with allowlisting and tamper protection. Combine these with observability and automation pipelines so remediations are safe, auditable, and effective.

Call to action

Ready to harden your Windows services? Start with a two-step project this week: 1) Identify the top 5 processes you need to protect, and 2) deploy an NSSM wrapper or native service with SCM recovery and a PowerShell watchdog. If you want a template, download our production-ready NSSM + PowerShell watchdog bundle from the windows.page GitHub repo and adapt it to your observability stack.

Advertisement

Related Topics

#monitoring#services#PowerShell
w

windows

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T04:56:56.291Z