Problem

Docker containers running various services would fail intermittently - sometimes lasting days, sometimes weeks - in ways that couldn't be caught or resolved without manual restarts. There was no built-in mechanism to detect when a container had entered a bad state and recover it automatically without manual intervention.

Approach

Built a Python service that runs in its own Docker container, watches other containers on the host, and automatically restarts them after a configurable interval. Containers opt in to being watched by setting specific environment variables, keeping the integration lightweight and non-invasive. No changes to existing container definitions are required beyond adding the opt-in variable.

Outcome

After several months of running, the tool eliminated the intermittent service failures that were previously requiring manual intervention. Services that do fail are restarted frequently enough that downtime is negligible - the watcher catches and recovers them before the failure becomes noticeable.

Tech Stack

Python Docker Automation