Kubernetes Healthcheck

bentripin · March 14, 2021, 4:16am

Request addition of two API endpoints, no API key required.

ping - readiness probe, respond empty 200 always.
healthcheck - respond empty 200 if healthy, non 200 response if a CRITICAL health check has failed, such as missing data paths, write issues, or things indicating an external mount has failed.
Bonus: Config option to mark unhealthy if an upgrade is available, to trigger an auto update via restart.

This would allow better health monitoring in a Kubernetes cluster, pings can happen frequently without calling any internal functions adding unneeded load and healthcheck would allow some of the existing healthchecks to trigger a restart/migration for some more critical failures where it cannot proceed.

Thanks.

bakerboy448 · May 7, 2021, 2:21am

I believe there’s already an API endpoint For getting the outstanding health checks. So the second request is moot

The first I can see a use for. Though why not just curl the site and see if it responds as up?

bentripin · May 7, 2021, 3:44am

normally the health checks are very tight and need to respond very quickly, if I curl the site and the site is doing heavy work at the moment moving files between servers or something it may not respond in time and then it gets killed in middle of moving files and that’s even worse… I was having partial files and a bunch of restarts of the pods just curling the root of the site.

right now I’m polling /api/system/status?apikey= as this seems to be the less overhead and most consistent response, but it needs an API key so there’s a chicken-egg thing to deploying this where I don’t know the API key til after its deployed and I create it… a simple PONG response is how most workloads are configured, the overhead is so small that even on something like edge arm node doing some heavy work it can still respond within the second or so it has before the health fails.

The logic on these health checks is pretty minimal/simple… anything not 200 means its in bad health, I investigated the existing health check endpoint and it responds with a 200 even if its uneahlthy so it requires further processing of the returned data to determine health instead of simply using the response codes that anything else would use. In a k8’s environment if Sonarr wants to restart its self like for a pending update or a deeper issue like a NFS share is gone, it can simply serve up a non 200 response to that healthcheck and the scheduler will punt it out and start up a new instance.

Cheers

system · July 6, 2021, 3:44am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.