normally the health checks are very tight and need to respond very quickly, if I curl the site and the site is doing heavy work at the moment moving files between servers or something it may not respond in time and then it gets killed in middle of moving files and that’s even worse… I was having partial files and a bunch of restarts of the pods just curling the root of the site.
right now I’m polling /api/system/status?apikey= as this seems to be the less overhead and most consistent response, but it needs an API key so there’s a chicken-egg thing to deploying this where I don’t know the API key til after its deployed and I create it… a simple PONG response is how most workloads are configured, the overhead is so small that even on something like edge arm node doing some heavy work it can still respond within the second or so it has before the health fails.
The logic on these health checks is pretty minimal/simple… anything not 200 means its in bad health, I investigated the existing health check endpoint and it responds with a 200 even if its uneahlthy so it requires further processing of the returned data to determine health instead of simply using the response codes that anything else would use. In a k8’s environment if Sonarr wants to restart its self like for a pending update or a deeper issue like a NFS share is gone, it can simply serve up a non 200 response to that healthcheck and the scheduler will punt it out and start up a new instance.
Cheers