Alert runbooks
Alerts runbooks
This contains a collection of runbooks that need to be followed for each alert raised on the platform. Each alert should contain a short description and a deep link to the corresponding alert in this document.
KubePodCannotConnectToInternet
-
Is this affecting pods network and node network too?
Run a pod on the host network
kubectl run testbox --rm -it --image ubuntu:18.04 --overrides='{ "spec": { "hostNetwork" : true } }' -- /bin/bash
Then check you can reach the internet.
apt-get update && apt-get install curl -y curl https://www.google.com
Is that fails, check your NAT Gateway. Dashboard can be found in
platform-monitoring/NAT Gateway
dashboard in Grafana. -
Is the Cloud NAT configured correctly?
KubePodCannotConnectToDeveloperPortal
Fires when the blackbox exporter is unable to connect to the developer portal.
-
Is the developer portal running?
kubectl -n developer-portal get all
-
Are other alerts such as
KubePodCannotConnectToInternet
orContainerInErrorState
firing? -
What is preventing the pod from running?
kubectl -n developer-portal logs deployment.apps/developer-portal
-
Are there errors relating to the database?
ClusterAutoscalerNoScaleUp
Node auto-provisioning did not provision any node pool for the pending pod because doing so would violate resource limits.
For GCP/GKE, check logs to get more details:
resource.type="k8s_cluster" AND
log_id("container.googleapis.com/cluster-autoscaler-visibility") AND
( "noScaleUp" )
Review and update cluster-wide minimal resource limits set for cluster auto-scaler.
KubeHpaReplicasMismatch
Horizontal Pod Autoscaler has not matched the desired number of replicas for longer than 15 minutes. HPA was unable to schedule desired number of pods.
Check why HPA was unable to scale:
- not enough nodes in the cluster
- hitting resource quotas in the cluster
- pods evicted due to pod priority
In case of cluster-autoscaler you may need to set up preemptive pod pools to ensure nodes are created on time.
KubeHpaMaxedOut
Horizontal Pod Autoscaler (HPA) has been running at max replicas for longer than 15 minutes. HPA won’t be able to add new pods and thus scale application.
Notice: for some services maximizing HPA is in fact desired.
Check why HPA was unable to scale:
- max replicas too low
- too low value for requests such as CPU?
If using basic metrics like CPU/Memory then ensure to set proper values for requests
.
For memory based scaling ensure there are no memory leaks.
If using custom metrics then fine tune how app scales accordingly to it.
Use performance tests to see how the app scales.
ContainerInErrorState
Container is not starting up, stuck in waiting state.
-
Identify which pod is causing the issue in grafana:
https://<grafana_url>/explore?orgId=1&left=%7B%22datasource%22:%22gmp-datasource%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22expr%22:%22sum%28kube_pod_container_status_waiting_reason%7Breason%20%21%3D%5C%22ContainerCreating%5C%22%7D%29%20by%20%28reason,%20pod%29%20%3E%200%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D)
-
What is preventing pod to start? Is the container in CrashLoopBackOff? Check pod events:
kubectl -n <pod_namespace> describe pod <pod_name>
If in CrashLoopBackOff state, check the process within the container is correctly configured. More info on debugging can be found in GKE docs
-
Is the image being pulled correctly? Check namespace events:
kubectl -n <pod_namespace> get events --sort-by=.lastTimestamp
If in ErrImagePull or ImagePullBackOff check if the container name is configured correctly and the tag exists in the registry. More info on debugging can be found in GKE docs
-
Can the pod be scheduled? Check request/limits on the container and ensure there is enough in the cluster. More info on debugging can be found in GKE docs
NatGatewayHighPortUtilisation
Meaning
High port utilisation by NAT Gateway, port allocation reached 70%. Each external IP address provides 64,512 available ports that are shared by all VMs. Each port corresponds to a connection to unique destination address (IP:PORT: PROTOCOL). When NAT Gateway runs out of free ports, it will start dropping outbound packets (requests going out to the internet).
Impact
No outbound requests are affected at this point; however, you’re getting closer to the limit. Once utilisation is closer to 100%, some outbound requests will be affected.
Utilisation doesn’t have to reach 100% for requests to be affected. NAT Gateway will try to allocate at least the number
of ports specified in network.publicNatGateway.minPortsPerVm
configuration. If there are not enough ports available to
satisfy this value, no ports will be allocated to VM in need.
Diagnosis & Mitigation
Follow NAT Gateway IP Allocation Failures section.
NatGatewayIpAllocationFailed
Meaning
Failure in allocating NAT IPs to any VM in the NAT gateway. In result, services residing on affected VM’s will not be able to reach the internet. NAT Gateway allocates single IP to multiple VM’s. When there are not enough available NAT source IP addresses and source port tuples (IP:PORT:PROTOCOL), the NAT Gateway won’t be able to service any new outbound connections.
Impact
Some current outbound requests are affected.
Diagnosis & Mitigation
Follow NAT Gateway IP Allocation Failures section.
ArgoCDApplicationIsNotHealthy
Meaning
One of the application components is not healthy, resulting in the whole application being marked as unhealthy.
Impact
Tenant resources are allocated as the ArgoCD applications, meaning that part of the tenant functionality might not be working as expected. For example, some of the tenant permissions might be missing.
Diagnosis & Mitigation
- Identify the problematic component by checking the ArgoCD UI: https://argocd.{{ internalServices.domain }}
- Act accordingly to the error:
- If the error is related to environment configuration, fix the configuration.
- If the errors can be fixed by re-applying the manifests,
you can do so by running Sync operation for the resource with
Replace
andForce
options. Here is the How-to on how to do it. - If it doesn’t help, contact the Core Platform team.
ArgoCDApplicationIsNotSynced
Meaning
Application components state is different from the desired state.
Impact
Tenant resources are allocated as the ArgoCD applications, meaning that part of the tenant functionality might not be working as expected. For example, some of the tenant permissions might be missing.
Diagnosis & Mitigation
- Identify the problematic component by checking the ArgoCD UI: https://argocd.{{ internalServices.domain }}
- If the resource can’t be updated because it’s immutable,
it’s most probably a bug and should be fixed by Core Platform team.
But you can try to do a quick fix by syncing it with
Replace
andForce
options via ArgoCD UI. Here is the How-to on how to do it.
NoIapAuthOnInternalServiceDomain
Meaning
Some or all services on the platform’s designated internal services domain can be accessed without authentication. For
more information on platform’s authenticated endpoints see Platform ingress and
Internal services page. This alert is triggered by accessing podinfo
svc
on https://ready.{{ internalServices.domain }}
as unauthenticated user and receiving 200
response code
(expected 302
redirect to google auth)
Impact
Potentially, we’re exposing internal endpoints on public network that can be accessed by anyone. This is major
incident and should be dealt with immediately.
Diagnosis & Mitigation
-
Validate that you can access internal endpoint without authentication:
curl -v https://ready.{{ internalServices.domain }}
You’ll get statusCode
200
if you can, otherwise301
redirect to google auth endpoint. -
Go to IAP configuration in GCP console and select a project.
-
You’ll see two backend services (one for each k8s traefik svc), ensure the backend service with name containing
platform-ingress-traefik-iap
hasIAP
enabled. The other should haveIAP
disabled. -
Ensure there is no errors in
Status
column on both backend services. In case of errors, re-enabled IAP on affected backend service and re-test. -
Validate environment
Gateway
configurationkubectl get httproute -n platform-ingress
Check that
HOSTNAMES
are correctly configured, where{{ ingressDomains.domain }}
should point to IAP disabled backend service (traefik), and{{ internalServices.domain }}
should point to IAP enabled one (traefik-iap)kubectl get gateway -n platform-ingress -o yaml
Check that
allowedRoutes
are correctly configured according tohttpRoutes