Platform Operations

Accessing the Platform

This documentation guides platform users on accessing the CECG Kubernetes platform on Google Cloud Platform (GCP).

Prerequisites

Platform Readonly Group

The platform readonly group is used to specify which users in your organisation are authorised to generate a kubeconfig allowing them to execute kubectl commands against the cluster. This group DOES NOT authorise members to modify any resources in the cluster.

The group should be: platform-readonly@domain e.g. platform-readonly@cecg.io if your domain is cecg.io

GKE Security Group

We leverage GKE’s native support for using Google Groups to support RBAC within Core Platform clusters (see GKE Documentation). A requirement of this native support is a Google Group named gke-security-groups@domain to act as a top level group for nested team based groups.

This group should only contain other google groups as members, not users or service accounts (the user who creates the group will be an owner).

Group Member Visibility

The GKE + Google Groups native integration requires an additional configuration change to the gke-security-groups group. This group, and all nested groups within it must have their Group Settings -> Who can view members configuration set to Group members. As shown here:

GCP Group Settings

Cluster Access

Using corectl

corectl env connect <environment-name>

Manually

  • Start IAP Tunnel:
gcloud compute start-iap-tunnel ${BASTION_HOSTNAME} 3128 \
    --local-host-port localhost:${LOCAL_PROXY_PORT} \
    --project ${PROJECT_ID} \
    --zone ${BASTION_ZONE}
  • Access the Cluster with kubectl:
# Install the gke-gcloud-auth-plugin binary (required)
gcloud components install gke-gcloud-auth-plugin

# Update kubectl configuration
gcloud container clusters get-credentials ${CLUSTER_NAME} \
    --project ${PROJECT_ID} \
    --region=${COMPUTE_REGION}

# Update kubectl configuration to use the bastion as a proxy
kubectl config set clusters."$(kubectl config current-context)".proxy-url http://localhost:${LOCAL_PROXY_PORT}

# Verify user access to resources
kubectl auth can-i get pods \
    --namespace=${NAMESPACE} \
    --as=${USER}@${DOMAIN} \
    --as-group=${GROUP}@${DOMAIN}

💡 Tip: For a full example, check the Environments Repo.

GCP Registry Access

  • To access artifact registries, use the following command to configure Docker:
gcloud auth configure-docker ${GCP_REGION}-docker.pkg.dev
  • Now, you have read and write access to ${GCP_REGION}-docker.pkg.dev/${PROJECT_ID}/tenant.

Subsections of Platform Operations

Cluster Autoscaling

The platform supports running wide range of workloads while keeping its base cost low. It provides autoscaling capabilities to adjust the capacity of the cluster to handle flucations in traffic. When the traffic goes down, the cluster reduces to normal size.

When running our platform on GCP/GKE we choose to operate clusters in Standard mode. You can read more about pricing for Standard mode clusters. To generate a cost estimate based on your usage, use the pricing calculator.

Quotas

By default, GCP applies 32 CPU quota per cluster. Autoscaling won’t be able to provision new nodes when quota is reached. You may need to consider increasing the quota, so that you have enough resources to run your pods under peak loads.

Node Pools

Platform and tenant pods run on worker nodes which are governed by node pools.

By default, we create a small node pool that runs platform system pods. This pool will not be enough to run tenant workloads, so additional node pools need to be created via auto-provisioning and/or configured explicitly.

For GCP we recommend using Node Auto-Provisioning mode, so that GKE facilitates node pool scaling.

In special cases we allow Platform Operators to configure their node pools explicitly. This can be useful for a hybrid mode when general purpose nodes are provisioned automatically and there are additional node pools with special machine types to support CPU intensive workloads.

For the complete list of available machine types refer to Cloud Provider documentation, e.g. Machine families resource and comparison guide for GCP.

Default node pool

This is a small pool (1 node per zone) with e2-standard-2 machine type. By default, the platform system pods run there.

Assuming we are running a regional cluster with 3 zones, the minimal cost to operate the platform will be 3 nodes of e2-standard-2 machine type.

Scaling with Node Auto-Provisioning

Note

Enabling autoscaling will enforce 1 node to be present in the pool at all times to ensure system pods are running. If you wish to scale down node pools to 0, you have the option of disabling autoscaling. See limitations for more details.

Node Auto-Provisioning (NAP) is used to manage node pools. It allows us to support workloads with various CPU and memory requests by creating node pools with optimal machine types.

In order to enable Node Auto-Provisioning you should specify:

  • number of CPU cores for the whole cluster
  • number of gigabytes of memory for the whole cluster
  • autoscaling profile

Available Autoscaling Profiles:

  • BALANCED: The default profile for Standard clusters that prioritises keeping resources available for incoming pods.
  • OPTIMIZE_UTILIZATION: Prioritise optimizing utilization over keeping spare resources in the cluster. The cluster autoscaler scales down the cluster more aggressively. GKE can remove more nodes, and remove nodes faster.

The following examples enable Node Auto-Provisioning for the cluster via config.yaml:

  • By default, GCE quota is 32, we reserve a 2vCPU VM for the bastion, so that’s 30 cores left
cluster:
  gcp:
    autoscaling:
      cpuCores: 30
      memoryGb: 200
  • When GCE quota is raised to 64, we use a 2vCPU VM for the bastion, so that’s 62 cores left
cluster:
  gcp:
    autoscaling:
      cpuCores: 62
      memoryGb: 400
      profile: "OPTIMIZE_UTILIZATION"

Custom node pools configuration

This is discouraged, but available for users for which the default pool doesn’t work.

Node pools need:

  • machine type, e.g. e2-standard-4
  • range for the number of nodes in the pool.
  • optional taints and labels
Note

We cannot explicitly define max CPU or memory for the node, we can do it only by choosing the proper machine type.

GKE automatically creates new nodes in the pool until it reaches their maximum count. When limit is reached newly deployed pods will stay in Pending state waiting for resources to become available.

In specific cases it is possible to combine NAP with explicit node pools configuration. For example, we may need machines with special capabilities. While having common workloads scheduled on auto-provisioned nodes, we may have custom node pools that are managed explicitly.

The following is an example of a hybrid configuration providing a node pool with special capabilities:

cluster:
  gcp:
    autoscaling:
      cpuCores: 50
      memoryGb: 120
    additionalNodePools:
      - name: "g2-pool"
        machineType: "g2-standard-4"
        minCount: 0
        maxCount: 2
        labels:
          gpu: "true"
        taints:
          - key: "gpu"
            value: "true"
            effect: "NO_SCHEDULE"

Cluster Overprovisioning

Problem

When tenants provision more replicas they may experience long delays between deploying a new Pod and actually running it. Provisioning of a new node takes time, it can be up to several minutes. This may not be an issue for a background job, but can definitely affect the performance of an API that needs to scale quickly to be able to handle traffic spikes.

Solution

Overprovisioning

Can be configured using deployment running pause pods with very low assigned priority (see Priority Preemption) which keeps resources that can be used by other pods. If there is not enough resources then pause pods are preempted and new pods take their place. Next pause pods become unschedulable and force CA to scale up the cluster.

For more details refer to FAQ: How can I configure overprovisioning with Cluster Autoscaler.

Platform Cost vs Autoscaling Speed

Cluster Overprovisioning will make the scheduling of new Pods faster, but may require additional costs.

There is a choice between keeping the operational cost of the platform low and reserving additional resources for faster autoscaling.

DNS

The Core Platform supports as many DNS Zones for regular Ingress.

Configure them in environment config.yaml e.g.

ingress_domains:
  - name: default
    domain: "gcp-dev.cecg.platform.cecg.io"

DNS Zone delegation (per environment)

Before Ingress can function, you must delegate the configured zone to the Google nameservers. These nameservers are not fixed, and can be seen after deployment of platform Ingress with the following command:

kubectl -n platform-ingress get dnsmanagedzones.dns.cnrm.cloud.google.com -o=jsonpath='{.items[0].status.nameServers}'

The output will look something like:

  * ns-cloud-e1.googledomains.com.
  * ns-cloud-e2.googledomains.com.
  * ns-cloud-e3.googledomains.com.
  * ns-cloud-e4.googledomains.com.

Once you have these values setup NS records in your DNS provider’s configuration e.g.:

Type: NS 
Name: gcp-pre-dev.cecg.platform 
Value: 
  * ns-cloud-e1.googledomains.com.
  * ns-cloud-e2.googledomains.com.
  * ns-cloud-e3.googledomains.com.
  * ns-cloud-e4.googledomains.com.

Default Zone

When your Core Platform is initially deployed it comes with a zone under CECG’s domain, including your organisation’s <name>. This is just to aid the initial developer experience and isn’t intended for your production services. The domains follow this structure:

<name>.platform.cecg.io

<env>.<name>.platform.cecg.io

<name> the name of the set of environments. Each set of environment typically is made up of pre-dev, dev, prod.

<env> is the environment name within a Core Platform.

For example, we (cecg) run our own instance of the core platform with a name of cecg so we will end up with:

  • sandbox.cecg.platform.cecg.io
  • pre-dev.cecg.platform.cecg.io
  • dev.cecg.platform.cecg.io
  • prod.cecg.platform.cecg.io

The cecg.platform.cecg.io is not currently managed so the user of the Core Platform needs to create and delegate every env’s zone.

Managed Databases

Benefits

  • Easy setup
  • Integration with existing Cloud Security methods (IAM Auth etc)
  • Less human resources involved in setting them up
  • Downtime SLA’s provided
  • Easily configurable HA and scalability
  • Automatic upgrades
  • Out of the box backups
  • Out of the box data encryption, which is a requirement for things like PCI

Disadvantages

  • Vendor locked
  • Less configurable compared to self hosted
  • Higher running cost

Managing Platform Environments

Environments

The platform by default comes with a single organisation made up of three environments:

  • pre-dev: Stage new versions of the platform before releasing to application engineers
  • dev: Used for all development environments, functional testing, integrated testing, performance testing
  • prod: Production

Gcp init

The gcp-init tool comes bundled inside the platform docker image. From inside the image:

cd gcp-init
source .venv/bin/activiate
python gcp-init.py --help

Creating an organisation

It is expected to have a single organisation, meaning one set of environments.

Before execute core-platform-cli we use any of the commands, we need to generate some Application Default Credentials (ADC):

gcloud auth application-default login
mv $HOME/.config/gcloud/application_default_credentials.json $HOME/.config/gcloud/Platform-user.json

This will generate JSON credentials for your user account, and move them to a file called Platform-user.json in the gcloud config directory. We’ll use this file to authenticate as your user when running certain commands.

Before running root-init, we need to check some GCP permissions. Before we can initialize a “parent” (a GCP org or folder), we need the following:

  • roles/viewer on the parent
  • roles/resourcemanager.folderCreator on the parent
  • roles/resourcemanager.projectCreator on the parent

We can now run the root-init command from within the core-platform-cli:

PARENT_FOLDER=??? # e.g. folders/12345
BILLING_ACCOUNT=??? # e.g. 01BC90-344AF2-CED391
ENV_REPO=???  # e.g. e.g. {{ github-org }}/platform-environments
PLATFORM_ADMIN_GROUP=???
GOOGLE_APPLICATION_CREDENTIALS=$HOME/.config/gcloud/Platform-user.json \
python gcp-init.py root-init \
    --parent=$PARENT_FOLDER \
    --billing-account=$BILLING_ACCOUNT \
    --github-repos=$ENV_REPO \
    --impersonators=group:$PLATFORM_ADMIN_GROUP \
    --owners=group:$PLATFORM_ADMIN_GROUP

This script is idempotent, if you encounter Cloud API failures, you can generally re-run the command and it will only create the missing resources.

Warning

If the errors are that your user doesn’t have permissions to do something, that probably won’t be fixed by re-running

Creating an environment

To be able to run the next step command (gcp-init.py env-init ...) we need to generate some new application credentials, allowing us to impersonate the env-manager service account from the previous step:

SA_EMAIL=$(jq -r .env_mgr_sa.email output-root_init.json)
gcloud auth application-default login --impersonate-service-account=$SA_EMAIL
mv $HOME/.config/gcloud/application_default_credentials.json $HOME/.config/gcloud/Platform-env-manager.json

This will generate JSON credentials for the service account, and move them to a file called Platform-env-manager.json in the gcloud config directory. We’ll use this file to authenticate as the SA when running certain commands.

We can now extract some information from output-root_init.json to use as input to the env-init command:

ENV=??? # e.g. dev-2
PLATFORM_ADMIN_GROUP=???
ENV_REPO=???  # e.g. e.g. {{ github-org }}/platform-environments
BILLING_ACCOUNT=???
SHARED_FOLDER=$(jq -r .shared_folder.name output-root_init.json)
ENVIRONMENTS_FOLDER=$(jq -r .environments_folder.name output-root_init.json)
MANAGEMENT_FOLDER=$(jq -r .management_folder.name output-root_init.json)

GOOGLE_APPLICATION_CREDENTIALS=$HOME/.config/gcloud/Platform-env-manager.json \
python gcp-init.py env-init \
    --shared-folder=$SHARED_FOLDER \
    --environments-folder=$ENVIRONMENTS_FOLDER \
    --management-folder=$MANAGEMENT_FOLDER \
    --github-repos=$ENV_REPO \
    --impersonators=group:$PLATFORM_ADMIN_GROUP \
    --name=$ENV \
    --owners=group:$PLATFORM_ADMIN_GROUP \
    --billing-account=$BILLING_ACCOUNT

This will attempt to initialize an environment called dev, as before, the script is idempotent and can generally be re-ran in the face of errors until it has completed.

![WARNING] If the errors are that your SA doesn’t have permissions to do something, that probably won’t be fixed by re-running

Manual Steps

Brand

The IAP Brand needs to be manually created before deploying the Core Platform. After running the environment creation. Get the project id under the new environments folder then run:

PLATFORM_ADMIN_GROUP=???
PROJECT=???
gcloud iap oauth-brands create --application_title="Core Platform" --support_email="$PLATFORM_ADMIN_GROUP" --project $PROJECT

![WARNING] The person executing this much be an owner of the $PLATFORM_ADMIN_GROUP

DNS Delegation

See DNS Delegation and Identity Provider Login

Minimise Costs

The platform allows you to optimise your infrastructure for maximum cost efficiency. There may be situations where you are running workloads where availability is not a priority e.g. lower environments.

Run workloads while keeping its base cost low using spot capabilities offering 60-91% discount compared to standard virtual machines (VMs)

Note

Default node pools hosting system pods run on standard VMs.

Spot Instances

These instances work by using Compute Engine (CE) excess capacity. When CE needs to use some of that capacity for example during zonal outages, it will shut down those instances to free up capacity.

Limitations

  • Spot instances are only available for supported machine types
  • Spot instances are not always available
  • Not covered by any Service Level Agreement (SLA)
  • You cannot live migrated Spot Instances to Standard Instances
  • Compute Engine service may stop instances abruptly if capacity is required
  • Cloud free tier credits do not count towards Spot Instances

To see more, check out spot limitations

Cost comparison

The following parameters are provided for node pools:

  • Number of instances: 5
  • Machine Type: e2-standard-2
  • Boot Disk Type: standard persistent disk
  • Boot Disk Size: 100GB
  • Region: london
Standard VMs: $339

Spot VMs: $138

Spot prices can change up to once every 30 days, but discounts for relevant resources are always in the 60-91% range.

Best Practices

Following points are recommended when running spot instances:

  • Pick smaller instance types to have a better chance of getting capacity
  • Run spot on weekends or evenings (off peak hours)
  • Design workloads to be fault tolerant e.g. stateless batch jobs

Enable Spot Instances

See spot instances.

Platform Monitoring

Platform Monitoring

Platform monitoring is responsible for ensuring the quality of the platform by providing visibility on the health of the platform and workloads running on it. It allows both platform operators and application teams answer two basic questions: what’s broken and why?

By collecting metrics from key platform components such as control plane, data plane, bastion and tenants workloads, platform monitoring allows operators and application teams to:

  • analyse long-term trends (e.g. how quickly is my database growing)
  • compare impact of feature releases
  • notify when something is broken or might break soon
  • help answer basic questions about the health of the system using dashboards

Platform Uptime

Uptime is measured by traffic sent continuously to an application deployed as part of the platform.

To view uptime for each environment you can use Continuous Load Grafana dashboard. For quick access it, run:

corectl env open <env> grafana/continuous-load

Application resource monitoring

This dashboard allows a tenant to monitor their namespaces. It shows:

  • CPU Usage
  • Memory usage
  • Pod status
  • Pod restart count

Namespace Dashboard

Cluster wide resource monitoring

Global view dashboard gives holistic view of the cluster. It shows:

  • Nodes status
  • CPU and Memory usage, requests and limits
  • Pod and namespace count
  • Pods status

Global Dashboard

Cluster connectivity monitoring

The platform-monitoring module also deploys continuous load. This will create k6 injectors and pods with podinfo, always with a stable throughput allowing us to monitor with enough data the different percentiles and any errors that occur to ensure that we can be proactive in investigating and fixing any issues.

Continuous load Dashboard

Platform liveness

Shows uptime and probe success rate and duration of key endpoints we monitor on the platform. It can also be used to check SSL expiry.

Platform liveness

Platform alerts

Firing and silenced alerts

Firing alerts can be viewed in the Grafana UI either via the Alerts dashboard or via the built-in Alerting section. Alerts can be silenced via the Alerting section by matching the alert(s) label that needs silencing.

Alert silencing

Outbound Connections

Outbound Connections

Cloud NAT (network address translation) lets certain resources in GCP create outbound connections to the internet or to other Virtual Private Cloud (VPC) networks, on-premises networks, or other cloud provider networks. Cloud NAT supports address translation for established inbound response packets only. It does not allow unsolicited inbound connections.

Outbound IP

By default, GCP allocates IP addresses automatically. The IP addresses are managed by GCP, added or removed based on the outbound traffic. This is the default platform configuration. With automatic allocation, you cannot predict the next IP address that is allocated. If you depend on knowing the set of possible NAT IP addresses ahead of time (for example, to create an allowlist), you should use manual NAT IP address assignment instead.

Static (Manual) Outbound IPs Assignment

The platform provides a feature that enables allocation of a number of static IP addresses that can be reserved upfront. When using that feature, you must calculate the number of regional external IP addresses that you need for the NAT gateway. If your gateway runs out of NAT IP addresses, it drops packets. You can increase or decrease the number of allocated static IP addresses by updating platform environment configuration. Those IP addresses are reserved and will remain so until you delete them. To enable this feature, we use Manual NAT IP address assignment with dynamic port allocation

Platform environment configuration

network:
  publicNatGateway: # [Optional] configuration for the NAT Gateway
    ipCount: 2 # [Required] number of IP addresses to allocate
    logging: ERRORS_ONLY # [Optional] enable logging, available values: ERRORS_ONLY,TRANSLATIONS_ONLY,ALL, when not provided no logging is enabled, we recommend setting to ERRORS_ONLY
    minPortsPerVm: 64 # [Optional] min number of ports per VM, when not provided default (64) is used
    maxPortsPerVm: 128 # [Optional] max number of port per VM, when not provided default (32,768) is used
    tcpEstablishedIdleTimeoutSec: 1200 # [Optional] timeout (in seconds) for TCP established connections (default 1200), only update when necessary, otherwise leave default
    tcpTransitoryIdleTimeoutSec: 30 # [Optional] timeout (in seconds) for TCP transitory connections (default 30), only update when necessary, otherwise leave default
    tcpTimeWaitTimeoutSec: 120 # [Optional] timeout (in seconds) for TCP connections that are in TIME_WAIT state (default 120), only update when necessary, otherwise leave default
network:
  publicNatGateway: # [Optional] configuration for the NAT Gateway
    ipCount: <numbers of IPs to allocate>
    logging: ERRORS_ONLY # enable logging for packet drops due to NAT IP allocation
    minPortsPerVm: <set min number of ports per VM>
    maxPortsPerVm: <set max number of ports per VM>

View assigned IP addresses

To view all assigned IP addresses to the NAT Gateway follow View NAT IP addresses assigned to a gateway

Increase allocated number of IP addresses

Warning

Allocating more IP addresses might cause source IP changes to existing services for outbound requests. If third party clients allowlisted specific IPs, they’ll need to update their allowlist accordingly.

  1. Calculate the number of IPs required. To understand your current NAT Gateway usage, see NAT Gateway dashboard in Grafana. For example of IP/ports calculations, see Port reservation example. To define min number of ports per VM, see Choose a minimum number of ports per VM
  2. Update environment configuration file as per Platform environment configuration

Cloud NAT gateway will dynamically allocate different number of ports per VM, based on the VM’s usage. Min and max ports settings are optional; however, it is strongly recommended to set those values to ensure strong tenant isolation (misbehaving services won’t acquire all available connections). For more information on port allocation see Ports section.

Note

Please note that NAT Gateway has GCP imposed limits, see NAT limits for details.

Decrease allocated number of IP addresses

Warning

Increasing the number of IPs is a safe operation; the existing connections won’t be affected, however, decreasing the value without draining the connections first will cause connection being terminated immediately. See Impact of tuning NAT configurations on existing NAT connections for further details.

All IP addresses are created sequentially, following naming convention <env>-nat-ext-ip-<number>, numbered from 0 to X. During IP address reservation the platform stores those IPs in an ordered list. We recommend that you remove one IP address at a time. Decreasing network.publicNatGateway.ipCount number by one causes removal of a last IP address in GCP, therefore make sure you drain last IP address. In case you remove/drain the wrong address, the release fails as you cannot delete addresses that are still allocated to NAT Gateway.

  1. Ensure you’ll choose the last created IP address; this is the address with the greatest number, following naming convention <env>-nat-ext-ip-<number>

  2. Drain existing connections, see Drain external IP addresses associated with NAT

  3. Confirm that all connections associated with the IP addressed drained are closed. You can do so by

    1. checking Port Allocation[NAT Gateway] graph in NAT Gateway Grafana dashboard, there should be no port allocation from drained IP.
    2. enable NAT Logging by:
    network:
      publicNatGateway:
        logging: ALL
        ...

    and checking that there are no logs for open connections associated to drained IP address.

  4. Remove drained IP address assignment from NAT Gateway in UI.

  5. Update network.publicNatGateway configuration and release:

    network:
      publicNatGateway:
        ipCount: 2 # decrease this number to desired number of IP addresses
        ...
  6. Notify any third parties on source IP changes for outbound connections so they can update their allowlists.

Migrate to static (manual) IP allocation

Warning

Switching IP assignment method is disruptive, and it breaks all active NAT connections. Further info can be found in Switch assignment method

  1. Update chosen environment configuration file as per Platform environment configuration

  2. Validate changes in GCP IP addresses, NAT Gateway

  3. Test outbound connection from the cluster

    kubectl run tmp-shell --rm -it --image nicolaka/netshoot -- /bin/bash # pod network
    kubectl run tmp-shell --rm -it --image nicolaka/netshoot --overrides='{ "spec": { "hostNetwork" : true }  }' -- /bin/bash # host network

    Once in the container run (if it fails, double check that Google is up on uptime)

    curl -I www.google.com

Migrate to automatic IP allocation

Warning

Switching IP assignment method is disruptive, and it breaks all active NAT connections. Further info can be found in Switch assignment method

  1. Repeat Decrease allocated number of IP addresses until you’ll be left with a single IP

  2. Manually update NAT Gateway configuration in GCP UI to set Cloud NAT IP addresses field to Automatic, then save.

  3. Test outbound connection from the cluster

    kubectl run tmp-shell --rm -it --image nicolaka/netshoot -- /bin/bash # pod network
    kubectl run tmp-shell --rm -it --image nicolaka/netshoot --overrides='{ "spec": { "hostNetwork" : true }  }' -- /bin/bash # host network

    Once in the container run (if it fails, double check that Google is up on uptime)

    curl -I www.google.com
  4. Remove network.publicNatGateway section from platform environment configuration and release.

Troubleshooting

For troubleshooting Platform NAT Gateway see Follow NAT Gateway IP Allocation Failures section. For more generic information on common issues and how to solve them with Cloud NAT see Troubleshooting guide

Platform Environments

A Platform environment is the underlying infrastructure for Application environments. It scales up and down automatically based on the applications deployed to it. Dev and prod are isolated to reduce the blast radius of platform deployments.

Each platform environment has a set of predefined workflows that will be executed to test, validate, and publish the application. These workflows ensure the stability and reliability of deployments across different stages.

Choosing a Platform Environment

When setting up a new tenant, you will be prompted to choose a platform environment eg:

  • Pre-Development (pre-dev) should only be used by platform operators for validating new versions of the environment. Most developers can ignore this option.
  • Development (dev) should be used for frequent testing and experimentation.
  • Production (prod) should only be used for live applications with end-user interactions.

Platform Ingress

Platform Ingress

Platform Ingress is responsible for creating the Ingress setup.

Design

Ingress design

What does it include?

  • Public and Private access
  • Cloud DNS managed zones to manage the DNS
  • External DNS
  • Traefik Ingress controller

How does DNS work?

Out of the box we will provide a subdomain of cecg.io. If they have their own domain, that can be configured based on the config.yaml. The base construction will be: <environment>.<organization>.<ingressDomain>, for example, sandbox-gcp.cecg.cecg.io. The private URL construction will be configurable on the environment config file, but we typically keep the same one with an internal suffix, like sandbox-gcp-internal.cecg.cecg.io

The platform-ingress module install a Gateway object that creates a GCP load balancer. External DNS picks that up and registers them with A records with the LB IP. All other domains are registered as CNAME records pointing to the A records.

If they use our cecg.io domain, there is an additional step that needs to be done for DNS Delegation. Since we own the domain, when trying to resolve anytihing ‘*.cecg.io` it will hit our DNS managed zones. We can however delegate the resolution of certain subdomains to specific. See DNS.

DNS delegation

How can clients manage records on a subdomain they do not own? We need to create a DNS delegation. For cecg for example, we’ll need to delegate that subdomain to cecg client. To do that, all you need to do is to create a NS record with the nameservers on the managed zone the cecg client created.

DNS Delegation

How can I differentiate between internal and external

There is a single ingress controller and load balancer. The way to differentiate is to use the ingress you configure in the environments config.yaml as internalService, for example sandbox-gcp-internal.cecg.cecg.io. Anything that hits that URL will be forward to the IAP for authentication and validations. Only people in the platform-readonly@<domain> google groups will be able to access that URL. This means that everything is public because we do not use VPNs to access the cluster and always go through a public LB, but we can restrict the access using IAP. For external DNS to work, each ingress will need to have the annotations

annotations:
  external-dns.alpha.kubernetes.io/hostname: reference-app.sandbox-gcp.cecg.cecg.io
  external-dns.alpha.kubernetes.io/target: sandbox-gcp.cecg.cecg.io

SSL

This will work out of the box until the LB using Let’s Encrypt. It uses a single level certificate (eg. *.sandbox-gcp.cecg.cecg.io) which will allow users to create single level subdomain like learn-functional.sandbox-gcp.cecg.cecg.io. Any more levels won’t work unfortunately.

Example ingress

External Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: reference-app-external
  annotations:
    external-dns.alpha.kubernetes.io/hostname: reference-app.sandbox-gcp.cecg.cecg.io
    external-dns.alpha.kubernetes.io/target: sandbox-gcp.cecg.cecg.io
  namespace: golang-dev
spec:
  ingressClassName: platform-ingress
  rules:
  - host: reference-app.sandbox-gcp.cecg.cecg.io
    http:
      paths:
      - path: /hello
        pathType: Prefix
        backend:
          service:
            name: reference-service
            port:
              number: 80

Internal Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: grafana
  annotations:
    external-dns.alpha.kubernetes.io/hostname: reference-app.sandbox-gcp-internal.cecg.cecg.io
    external-dns.alpha.kubernetes.io/target: sandbox-gcp-internal.cecg.cecg.io
  namespace: platform-monitoring
spec:
  ingressClassName: platform-ingress
  rules:
  - host: reference-app.sandbox-gcp-internal.cecg.cecg.io
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: platform-grafana-service
            port:
              number: 3000

For more information on how to use this, please look at the App Ingress section.

Autoscaling

Platform Ingress scales automatically based on resource consumption to handle spikes of traffic.

Horizontal Pod Autoscaling is enabled for Traefik pods and is configurable via config.yaml.

The following example overrides resource requests for Traefik pods, sets a range for the number of replicas and overrides the CPU usage percentage threshold.

platformIngress:
  resources:
    requests:
      cpu: "500m"
      memory: "300Mi"
  minReplicas: 3
  maxReplicas: 21
  cpuPercent: 80

Future work

We aim to be tech agnostic and remove some redundancies, namely regarding external-dns annotations. For that we will create a mutating webhook that will inject the needed annotations based on the URL of the ingress. It will also check for conflicts in the configuration and block the creation of any already existing as that will cause the IC to load balance between 2 possibly completely distinct application

Control Plane and Node Upgrades

Background on GKE versions

In GKE, you cannot specify an exact control plane Kubernetes version. You also cannot downgrade what’s currently deployed. What you can do though, is set the minimum kubernetes version you want applied to your cluster.

The versions you get are influenced by Release Channels. Being subscribed to a Release Channel is generally considered a good practice for most, and the Release Channel you are subscribed to dictates the versions that you have available. For further information on this subject, take a look at this in depth documentation.

Without going into too much depth, there are currently 3 Release Channels:

RAPID

This channel delivers the absolute latest features, but with the caveat that these features may not be fully proven in production environments. It’s ideal for staying on the bleeding edge but comes with potential risks.

REGULAR

This is the default option and strikes a balance between providing new features and ensuring stability. Upgrades happen every few weeks, giving you access to new features without being the first adopter.

STABLE

This channel prioritizes stability and only receives the most well-tested updates. Upgrades occur less frequently than the Regular channel, making it suitable for production workloads requiring maximum uptime and reliability

Not subscribing to a release channel

As mentioned above, if you do not explicitly define your release channel, it will default to REGULAR. What you can do though is explicitly define the release channel to either null or unspecified, at which point you will not be subscribed to any release channel.

GKE will still upgrade your clusters on a scheduled basis, usually move them on to the next minor version, and apply security patches. There’s various pros and cons to not being subscribed, but some key points are:

Benefits
  • More control over the timing of your upgrades
  • Can stay on a specific Kubernetes version for a longer time
Drawbacks
  • Manual management is required, you will have to keep an eye out for newer Kubernetes versions and security patches and apply those yourself
  • Some security risks are also involved as if you don’t update in a timely manner your cluster might become vulnerable to security exploits.

Our Channel

We are currently subscribed to the REGULAR channel.

Our Kubernetes version

We dynamically source our Kubernetes versions, via a datasource with version prefix filtering.

The below is responsible for fetching versions that match the provided prefix.

data "google_container_engine_versions" "region_versions" {
  provider       = google-beta
  location       = var.gcp_region # Region definition as versions can vary between regions
  version_prefix = "${local.k8s_version}." # This is the version filter, at the time of writing this, it's 1.29.
}

Subsequently, we set the kubernetes_version like so:

kubernetes_version = data.google_container_engine_versions.region_versions.release_channel_latest_version.REGULAR

The kubernetes_version field is then implicitly mapped to the min_master_version field, since as mentioned above you can’t explicitly declare a Kubernetes version, you can only declare the minimum you want installed in a cluster.

Control Plane Upgrades

When a control plane update takes place, during a maintenance window or through a manual update, some downtime could be expected, depending on the Location type of your cluster.

Zonal

The Zonal Kubernetes clusters only have one master node backing them, and when an upgrade is taking place, there could be several minutes of master downtime. This means that kubectl stops working, and applications that require the Kubernetes API stop working. You also can’t make any cluster changes while the upgrade is taking place. Deployments, services and various other Kubernetes constructs still work during this time.

Regional

Regional clusters provide multi-zone, highly available Kubernetes masters (3 Zones). These masters are behind a loadbalancer, and upgrades are being done in such a way that there is no disruption during an upgrade. The masters are upgraded once at a time, in no specific order, and each one of the masters is unavailable only during it’s upgrade duration.

Our Location Type

We utilize Regional clusters, meaning that downtime should be kept to the minimum while upgrading the control plane.

Node upgrades

When it comes to upgrading your nodes, there’s more than a single strategy. That being said, between the strategies, some steps remain common:

  • The node to be upgraded is cordoned so no new pods can be scheduled on it
  • The node to be upgraded is drained. All strategies respect the pod’s PDBs (Pod Disruption Budgets) and GracefulTerminationPeriod setting (Up to an hour for SURGE, and configurable for BLUE-GREEN)

Depending on the strategy, node upgrades can take a few hours to finish.

Some factors that could affect the overall duration of the upgrade:

The SURGE strategy

This strategy upgrades nodes in a rolling fashion. Nodes are drained of traffic, upgraded, and brought back online while the remaining nodes continue handling traffic. Steps include:

  • Provision a new node
  • Wait for it to be ready
  • Cordon and drain the node to be replaced so no new pods are scheduled on it while existing workloads can finish running
  • Delete the node to be replaced

Keep in mind that resources need to be available for the new surge nodes to come up, otherwise GKE won’t start a node upgrade.

Primary Pros

  • Cost-effective
  • Simpler
  • Faster

Primary Cons

  • Potential Downtime (Apps running on the drained nodes)
  • No easy rollback (Requires manual downgrading of the affected nodes)
  • Main audience should be stateless applications (Where disruptions are more tolerated)

The BLUE-GREEN strategy

This strategy involves GKE creating a new set of node resources (the “green” nodes), with the new node configuration before evicting any workloads on the original resources (the “blue” nodes). It’s important to note that GKE will keep the “blue” nodes until all traffic have been shifted to the “green” nodes.

Primary Pros

  • Rollback mid upgrade if issues arise are possible
  • A safe space(green) for testing out the release
  • As close to 0 downtime as possible

Primary Cons

  • Significant cost
  • Complexity
  • Need to have much higher quota headroom than SURGE to work properly

Our node upgrade strategy

Both strategies have their use cases. In our case, we use the SURGE strategy, with max_surge set to 1 and max_unavailable set to 0. What this means is that only one surge node is added at a time, thus one node is being upgraded, at a time. Also, pods can restart immediately on the new surge node.

A SURGE strategy with the max_surge and max_unavailable values we use, is typically the slowest of the bunch (still much quicker that blue-green), but the least disruptive. By tweaking those 2 values you can balance speed and disruption potential.

Our node versions

We do not explicitly set any version for our nodes, but we have

auto_upgrade  = true

in our node_pool configuration. What this means is that every time the Kubernetes control plane is upgraded, a node upgrade is scheduled automatically for the next maintenance window, to match that version. Naturally, the node and control plane versions won’t be the same at all times, but it’s fine as long as we adhere to the Kubernetes version skew policy.

Identity Provider Login for Internal Services

Internal services such as:

  • Grafana
  • Platform Docs

Are exposed with Identity Provider login, currently only Google Gsuite is supported.

Configuring the domain for internal services

In your environment configuration

internal_services:
  name: secure
  domain: "gcp-dev-internal.cecg.platform.cecg.io"

The domain must be different from your ingress_domains.

DNS delegation for that domain should be configured as in DNS

Alert runbooks

Alerts runbooks

This contains a collection of runbooks that need to be followed for each alert raised on the platform. Each alert should contain a short description and a deep link to the corresponding alert in this document.

KubePodCannotConnectToInternet

  1. Is this affecting pods network and node network too?

    Run a pod on the host network

    kubectl run testbox --rm -it --image ubuntu:18.04 --overrides='{ "spec": { "hostNetwork" : true }  }' -- /bin/bash 

    Then check you can reach the internet.

    apt-get update && apt-get install curl -y
    curl https://www.google.com

    Is that fails, check your NAT Gateway. Dashboard can be found in platform-monitoring/NAT Gateway dashboard in Grafana.

  2. Is the Cloud NAT configured correctly?

KubePodCannotConnectToDeveloperPortal

Fires when the blackbox exporter is unable to connect to the developer portal.

  1. Is the developer portal running?

    kubectl -n developer-portal get all
  2. Are other alerts such as KubePodCannotConnectToInternet or ContainerInErrorState firing?

  3. What is preventing the pod from running?

    kubectl -n developer-portal logs deployment.apps/developer-portal
  4. Are there errors relating to the database?

ClusterAutoscalerNoScaleUp

Node auto-provisioning did not provision any node pool for the pending pod because doing so would violate resource limits.

For GCP/GKE, check logs to get more details:

resource.type="k8s_cluster" AND
log_id("container.googleapis.com/cluster-autoscaler-visibility") AND
( "noScaleUp" )

Review and update cluster-wide minimal resource limits set for cluster auto-scaler.

KubeHpaReplicasMismatch

Horizontal Pod Autoscaler has not matched the desired number of replicas for longer than 15 minutes. HPA was unable to schedule desired number of pods.

Check why HPA was unable to scale:

  • not enough nodes in the cluster
  • hitting resource quotas in the cluster
  • pods evicted due to pod priority

In case of cluster-autoscaler you may need to set up preemptive pod pools to ensure nodes are created on time.

KubeHpaMaxedOut

Horizontal Pod Autoscaler (HPA) has been running at max replicas for longer than 15 minutes. HPA won’t be able to add new pods and thus scale application.

Notice: for some services maximizing HPA is in fact desired.

Check why HPA was unable to scale:

  • max replicas too low
  • too low value for requests such as CPU?

If using basic metrics like CPU/Memory then ensure to set proper values for requests. For memory based scaling ensure there are no memory leaks. If using custom metrics then fine tune how app scales accordingly to it.

Use performance tests to see how the app scales.

ContainerInErrorState

Container is not starting up, stuck in waiting state.

  1. Identify which pod is causing the issue in grafana:

    https://<grafana_url>/explore?orgId=1&left=%7B%22datasource%22:%22gmp-datasource%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22expr%22:%22sum%28kube_pod_container_status_waiting_reason%7Breason%20%21%3D%5C%22ContainerCreating%5C%22%7D%29%20by%20%28reason,%20pod%29%20%3E%200%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D)
  2. What is preventing pod to start? Is the container in CrashLoopBackOff? Check pod events:

    kubectl -n <pod_namespace> describe pod <pod_name>

    If in CrashLoopBackOff state, check the process within the container is correctly configured. More info on debugging can be found in GKE docs

  3. Is the image being pulled correctly? Check namespace events:

    kubectl -n <pod_namespace> get events --sort-by=.lastTimestamp

    If in ErrImagePull or ImagePullBackOff check if the container name is configured correctly and the tag exists in the registry. More info on debugging can be found in GKE docs

  4. Can the pod be scheduled? Check request/limits on the container and ensure there is enough in the cluster. More info on debugging can be found in GKE docs

NatGatewayHighPortUtilisation

Meaning

High port utilisation by NAT Gateway, port allocation reached 70%. Each external IP address provides 64,512 available ports that are shared by all VMs. Each port corresponds to a connection to unique destination address (IP:PORT: PROTOCOL). When NAT Gateway runs out of free ports, it will start dropping outbound packets (requests going out to the internet).

Impact

No outbound requests are affected at this point; however, you’re getting closer to the limit. Once utilisation is closer to 100%, some outbound requests will be affected.

Warning

Utilisation doesn’t have to reach 100% for requests to be affected. NAT Gateway will try to allocate at least the number of ports specified in network.publicNatGateway.minPortsPerVm configuration. If there are not enough ports available to satisfy this value, no ports will be allocated to VM in need.

Diagnosis & Mitigation

Follow NAT Gateway IP Allocation Failures section.

NatGatewayIpAllocationFailed

Meaning

Failure in allocating NAT IPs to any VM in the NAT gateway. In result, services residing on affected VM’s will not be able to reach the internet. NAT Gateway allocates single IP to multiple VM’s. When there are not enough available NAT source IP addresses and source port tuples (IP:PORT:PROTOCOL), the NAT Gateway won’t be able to service any new outbound connections.

Impact

Some current outbound requests are affected.

Diagnosis & Mitigation

Follow NAT Gateway IP Allocation Failures section.

ArgoCDApplicationIsNotHealthy

Meaning

One of the application components is not healthy, resulting in the whole application being marked as unhealthy.

Impact

Tenant resources are allocated as the ArgoCD applications, meaning that part of the tenant functionality might not be working as expected. For example, some of the tenant permissions might be missing.

Diagnosis & Mitigation
  1. Identify the problematic component by checking the ArgoCD UI: https://argocd.{{ internalServices.domain }}
  2. Act accordingly to the error:
    • If the error is related to environment configuration, fix the configuration.
    • If the errors can be fixed by re-applying the manifests, you can do so by running Sync operation for the resource with Replace and Force options. Here is the How-to on how to do it.
    • If it doesn’t help, contact the Core Platform team.

ArgoCDApplicationIsNotSynced

Meaning

Application components state is different from the desired state.

Impact

Tenant resources are allocated as the ArgoCD applications, meaning that part of the tenant functionality might not be working as expected. For example, some of the tenant permissions might be missing.

Diagnosis & Mitigation
  1. Identify the problematic component by checking the ArgoCD UI: https://argocd.{{ internalServices.domain }}
  2. If the resource can’t be updated because it’s immutable, it’s most probably a bug and should be fixed by Core Platform team. But you can try to do a quick fix by syncing it with Replace and Force options via ArgoCD UI. Here is the How-to on how to do it.

NoIapAuthOnInternalServiceDomain

Meaning

Some or all services on the platform’s designated internal services domain can be accessed without authentication. For more information on platform’s authenticated endpoints see Platform ingress and Internal services page. This alert is triggered by accessing podinfo svc on https://ready.{{ internalServices.domain }} as unauthenticated user and receiving 200 response code (expected 302 redirect to google auth)

Impact

Potentially, we’re exposing internal endpoints on public network that can be accessed by anyone. This is major incident and should be dealt with immediately.

Diagnosis & Mitigation
  1. Validate that you can access internal endpoint without authentication:

    curl -v https://ready.{{ internalServices.domain }}

    You’ll get statusCode 200 if you can, otherwise 301 redirect to google auth endpoint.

  2. Go to IAP configuration in GCP console and select a project.

  3. You’ll see two backend services (one for each k8s traefik svc), ensure the backend service with name containing platform-ingress-traefik-iap has IAP enabled. The other should have IAP disabled.

  4. Ensure there is no errors in Status column on both backend services. In case of errors, re-enabled IAP on affected backend service and re-test.

  5. Validate environment Gateway configuration

    kubectl get httproute -n platform-ingress

    Check that HOSTNAMES are correctly configured, where {{ ingressDomains.domain }} should point to IAP disabled backend service (traefik), and {{ internalServices.domain }} should point to IAP enabled one (traefik-iap)

    kubectl get gateway -n platform-ingress -o yaml

    Check that allowedRoutes are correctly configured according to httpRoutes

Platform Implementation

Platform Implementation describes how things work rather than how to use them. This section of the docs is for Platform Engineers wanting to contribute or understand how things work under the covers.

Subsections of Platform Implementation

Alerting

Alerting

Alert manager

Managed alertmanager is a single replica statefulset deployed with Google Managed Prometheus. It receives alerts from the rule evaluator and sends notification to configured receivers.

kubectl -n gmp-system get sts alertmanager 
kubectl -n gmp-system get deploy rule-evaluator

Alerts definitions

Alerts are defined using Rules, ClusterRules or GlobalRules.

Rules spec follows the same format as a prometheus rules files, which makes it possible to test using promtool To view alert rules, run

kubectl -n platform-monitoring describe rules

Grafana

Grafana

Grafana is installed using the grafana operator which manages the grafana instance, dashboards and datasources using CRDs. CRDs API reference: https://grafana-operator.github.io/grafana-operator/docs/api/

It runs as a deployment:

kubectl -n platform-monitoring get deploy grafana-operator
kubectl -n platform-monitoring get deploy platform-grafana-deployment

Dashboards

Dashboards are automatically synced by the operator. You can use the grafanadashboard resources to check its status and when it was last synced.

kubectl -n platform-monitoring get grafanadashboard
NAME                    NO MATCHING INSTANCES   LAST RESYNC   AGE
bastion                                         42m           7h43m
continuous-load                                 2m17s         7h43m
kubernetes-apiserver                            42m           7h43m
[...]

When exporting dashboard json from Grafana, make sure special characters are replaced as follows

  • replace {{ target }} with {{ "{{" }} target {{ "}}" }}
  • replace $somevar with ${somevar}

Datasources

  • Prometheus: points to the prometheus frontend to access all dashboard metrics
  • Alertmanager: points to the managed alertmanager to manage silences, view firing alerts, contact points, and notification policies

Infra Connector

Infra Connector

Infra connector is a module in the reference core platform that allows to create cloud objects using kubernetes resources.

Current implementation for GCP uses the k8s config connector Installing these will make available a variety of CRD that will allows to create different GCP resources without the need to write terraform code. For example:

apiVersion: iam.cnrm.cloud.google.com/v1beta1
kind: IAMServiceAccount
metadata:
  name: {{include "account.fullname" .Values.tenant.name }}
  annotations:
    propagate.hnc.x-k8s.io/none: "true"
spec:
  displayName: GCP Service Account for tenant {{ .Values.tenant.name }}

This will create a GCP SA for each the tenant being provisioned

Current usage

The goal of this modules is to decouple terraform from the platform modules. Having this allows us to create cloud resources with something like helm and doesn’t tie us down to terraform. Meaning we can couple or decouple any other modules a lot easier. This means that this is 1 of the 2 modules in the current implementation that uses terraform, everything else is installed with the help of a script. If they require cloud resources, they will create them using the infra connector CRDs.

Future usage

Another advantage of using this is that we can allow tenants to create GCP resources like buckets, databases etc that they might need without needing to reaching out to the platform or to a DevOps team, making the more independent. What they can and can’t create will be control with a mix of RBAC and policy controller - A Role that will specify which objects they can create, and the policy controller to ensure what they create is allowed and it won’t impact any other tenant.

Metrics collection

Metrics collection

The platform uses Google Managed Prometheus which comes with a scalable backend prometheus storage and metrics collectors that scrape exposed metrics endpoints such as kubelet/cadvisor and kube state metrics via CRDs. CRDs are defined here: https://github.com/GoogleCloudPlatform/prometheus-engine/blob/v0.7.4/doc/api.md

The GMP operator runs as a deployment

kubectl -n gmp-system get deploy gmp-operator

Kube state metrics - docs

Generates metrics from a wide range of Kubernetes objects. These can be used to assess the health of your pods, deployment, jobs and many other Kubernetes objects.

They generally start with kube_.

It runs as a deployment:

kubectl -n gmp-public get deploy kube-state-metrics

Note that GMP re-labels namespace to exported_namespace as it reserves namespace for the namespace of the pod that the metric is scraped from. When importing dashboards that rely on kube-state-metrics metrics, the queries must use exported_namespace.

cadvisor - docs

Collects metrics for containers running on the node ; it runs alongside kubelet on each node. Typical metrics include cpu, memory, I/O usage which can be used to diagnose performance issues.

They generally start with container_

kubelet - docs

kubelet is the agent running on the node that is responsible to ensure containers are running and healthy. Collected metrics can be used to identify pod start duration, the number of pods and containers on the node and other information about the node, such as status

Blackbox exporter - docs

This is used to probe key endpoints on or outside the platform, so we can monitor uptime and SSL expiry of components with TLS termination. It runs as a deployment:

kubectl -n platform-monitoring get deploy prometheus-blackbox-exporter

Node exporter - docs

Collects hardware and OS-level system metrics exposed on the node. Metrics include host memory, cpu, disk and network. It runs as a daemonset:

kubectl -n gmp-public get ds node-exporter

Tenancy

Tenancy

Tenants are organised via the Hierarchical Namespace Controller

  • cecg-system: Internal components
  • reference-applications: Applications to show you how to use the platform

The application teams can create tenancies under root or another top level folder e.g. tenants

❯ kubectl hns tree root
root
β”œβ”€β”€ cecg-system
β”‚   β”œβ”€β”€ platform-ingress
β”‚   β”œβ”€β”€ platform-monitoring
β”‚   └── platform-policy
β”œβ”€β”€ reference-applications
β”‚   └── knowledge-platform
β”‚   └── golang
β”‚       β”œβ”€β”€ [s] golang-functional
β”‚       └── [s] golang-nft
β”‚       └── [s] golang-integration
└── tenants
    β”œβ”€β”€ cecg-playground
    └── devops-playground

[s] indicates subnamespaces

Platform Troubleshooting

Autoscaling failures

0/7 nodes are available: 7 Insufficient memory

0/7 nodes are available: 7 Insufficient memory. preemption: 0/7 nodes are available: 1 Insufficient memory, 6 No preemption victims found for incoming pod.

Pods are stuck in Pending state.

Total memory requests for pods have exceeded the maximum memory that is allowed as part of node autoscaling in the cluster.

Resolution

Update config.yaml to increase the memory limit of the cluster autoscaling. Example:

cluster:
  gcp:
    autoscaling:
      cpuCores: 30
      memoryGb: 140

After the limits have been applied to the cluster, the pod should transition from Pending to Running state.

0/7 nodes are available: 7 Insufficient cpu

0/7 nodes are available: 7 Insufficient cpu. preemption: 0/7 nodes are available: 1 Insufficient cpu, 6 No preemption victims found for incoming pod.

Pods are stuck in Pending state.

Total cpu requests for pods have exceeded the maximum cpu that is allowed for node autoscaling.

Resolution

Update config.yaml to increase the cpu limit of the cluster autoscaling. Example:

cluster:
  gcp:
    autoscaling:
      cpuCores: 60
      memoryGb: 140

Node Imbalance

There are times where a node can be throttled e.g. 96% memory usage when other nodes have more than enough capacity to accommodate extra workloads.

It is highly likely that pods running on that node do not have memory/cpu requests set. This causes kube scheduler to place significant load on one node as it uses the requests to target what nodes pods should be placed on.

Resolution

Set resource requests for your application workloads to allow the kube scheduler better place your pods on nodes with appropriate capacity. For example if you request 2Gi memory for your pod, the scheduler will guarantee finding a node that has that capacity.

Deployment Failures

Local port [xxxx] is not available

A local port is used locally on the GitHub agent to IAP proxy to the Kubernetes API server. Sometimes a randomly selected port is not available.

Logs

2023-11-17T16:31:21.6142491Z --- Start tunnel (IAP)
2023-11-17T16:31:28.1159411Z ERROR: (gcloud.compute.start-iap-tunnel) Local port [57834] is not available.

Actions

The job can be re-run using re-run failed jobs

Ingress / TLS Failures

A new ingress domain is not working

When adding a new ingress domain the platform:

  • Creates a Cloud DNS Managed Zone. You need to set up delegation for this domain so that Cloud DNS becomes the Name server.

IPs not being allowlisted by traefik

You have configured to allowlist IPs using traefik middlewares but are still getting forbidden when accessing endpoints from a valid IP address.

Enable JSON logs

Edit traefik deployment to add the arguments:

kind: Deployment
metadata:
  name: traefik
  namespace: platform-ingress
...
spec:
    containers:
    - args:
      - --log.level=DEBUG
      - --log.format=json
      - --accesslog=true
      - --accesslog.format=json

View traefik logs

kubectl logs -f deployment/traefik -n platform-ingress

Logs:

  { "level": "debug","middlewareName": "platform-ingress-ready-ipwhitelist@kubernetescrd","middlewareType": "IPWhiteLister","msg": "Accepting IP 86.160.248.78","time": "2024-08-07T22:03:23Z" }
Check Load Balancer logs

On the Google Console navigate to Logging Explorer navigate and run the following query

resource.type="http_load_balancer" resource.labels.project_id="<your-gcp-project-id>"

Load Balancer Logs

Actions

1. Setup DNS Zone delegation for the new domain

See dns delegation setup

2. Restart cert manager

Cert manager at times does not find the new Cloud DNS Zone. If this is the case you’ll see cert manager logs like:

E1124 04:28:57.742110       1 controller.go:167] "cert-manager/challenges: re-queuing item due to error processing" err="No matching GoogleCloud domain found for domain XXXX.XX." key="platform-ingress/ingress-tls-secure-1-2070261079-2597927957"

Restarting cert manager:

kubectl delete pods -n platform-ingress -l app=cert-manager

On restarting the error should go away. If not, raise a support ticket with the logs for:

  • External-DNS
  • Cert Manager
  • Output from:
kubectl get pods -n platform-ingress
kubectl get certificates -n platform-ingress
kubectl get certificaterequests -n platform-ingress
kubectl describe gateway traefik -n platform-ingress

NAT Gateway IP Allocation Failures

  1. Are there not enough IPs allocated? Validate the root cause:

    1. Single service is keeping a lot of connections open
      1. Go to Grafana platform-monitoring/NAT Gateway dashboard and check:

        1. which VM’s have unusual high open connections,
        2. which VM’s have unusual high number of allocated ports
        3. correlate this information with Network Received by Namespace graph in platform-monitoring/Kubernetes/Views/Global dashboard on Grafana. To find namespace that uses most network bandwidth, then check which pod belongs to that namespace in Network Bandwidth graph in platform-monitoring/Kubernetes/Views/Pods.
      2. [Optional] Enable NAT Gateway logging if not already enabled. Logging provides more detailed information on existing connections. To enable logs update network.publicNatGateway.logging value to one of ERRORS_ONLY, TRANSLATIONS_ONLY, ALL:

        network:
          publicNatGateway:
            logging: TRANSLATIONS_ONLY - update to desired logging level

        See Configure logging for log level explanation.

    2. Cluster Autoscaller is creating an excessive number of VMs. Validate in GCP dashboard, node-pools, metrics
    3. The Cluster grew naturally and more source NAT IPs are required. If above are not causing the issue, validate that there is a valid reason for more IPs to be reserved and attached to the NAT Gateway. Look at long-term trends of services, node, port allocation growth.
    Warning

    Allocating more IP addresses might cause source IP changes to existing services for outbound requests. If third party clients allowlisted specific IPs, they’ll need to update their allowlist accordingly.

    Increase the number of IPs allocated to NAT Gateway. Update number of IPs in your environments/<env_name>/config.yaml file:

    network:
      publicNatGateway:
        ipCount: <number of IP's allocated> - increase this value to desired number of IP's

    Release the change and validate that port utilisation went down below 70%.

  2. Do you allocate too many min ports per VM?

    1. Go to Grafana platform-monitoring/NAT Gateway dashboard and validate allocated ports per VM against used ports per VM. Ensure to extend time span to take into account all traffic spikes. If most of the time ports are being allocated but not used you can decrease the minPortsPerVm setting in environments/<env_name>/config.yaml file:

      network:
        publicNatGateway:
          minPortsPerVm: <min number of ports allocated to single VM> - decrease this value to release ports

      See Choose a minimum number of ports per VM for further details.

    2. If all ports are utilised, check if you can Reduce your port usage otherwise increase the ipCount value (section 1)

Warning

Increasing the number of IPs is a safe operation; the existing connections won’t be affected, however, decreasing the value without draining the connections first will cause connection being terminated immediately. See Impact of tuning NAT configurations on existing NAT connections for further details.

NAT Gateway provisioning/updates failures

Error: Error when reading or editing Address: googleapi: Error 400: External address used for NAT cannot be deleted., badRequest

During updates to your network.publicNatGateway.ipCount configuration when you try to remove already allocated IP address the update will fail. To decrease the number of allocated IPs, please drain it first then remove it from NAT Gateway manually before running IAC tool. Follow Decrease allocated number of IP addresses.

NAT Gateway high error count in logs

NAT Gateway will only log errors related to packet drops because no port was available for NAT. To investigate and resolve, follow below:

  1. Increase the log level to ALL (if not already on this level) to get more details about successful connections (see if any of errored connections are successful)
  2. If high errors count on packets send, check the trend of packet drops per reason in platform-monitoring/NAT Gateway Grafana dashboard, for OUT_OF_RESOURCES reason follow Packets dropped with reason: out of resources
    Note

    We use manual NAT IP address assignment with dynamic port allocation

NAT Gateway high number of received packet drops

Note

A Cloud NAT gateway drops an ingress data packet if the connection tracking table doesn’t contain any entry for the connection. Those can be due to timeouts or external endpoint trying to establish a connection. Higher number than usual might not indicate any degradation of any services.

  1. Check long window trend of received packet drop rate in platform-monitoring/NAT Gateway Grafana dashboard. Establish which VM/pod is experiencing the highest drops, then validate it is a genuine failure. For genuine failures, see Dropped received packets for hints on how to resolve.

Subsections of How-Tos

Autoscale cluster

Warning

Cluster autoscaling will not work unless resource requests for your workloads are defined. See setting resource requests.

Note

Nodes typically take up to 80 to 120 seconds to boot. If this is too long for you, see overprovisioning.

Enable Autoscaling

Autoscaling can be enabled using the config.yaml file. See cluster autoscaling for more details of how autoscaling works.

cluster:
  gcp:
    autoscaling:
      cpuCores: 60
      memoryGb: 200
      profile: "OPTIMIZE_UTILIZATION"

Disable Autoscaling

Simply remove autoscaling block from config.yaml

Before

cluster:
  gcp:
    autoscaling:
      cpuCores: 20
      memoryGb: 80
      profile: "OPTIMIZE_UTILIZATION"
    additionalNodePools:
      - name: "4-pool"
        diskType: "pd-standard"
        machineType: "e2-standard-4"
        minCount: 0
        maxCount: 5

After

cluster:
  gcp:
    additionalNodePools:
      - name: "4-pool"
        diskType: "pd-standard"
        machineType: "e2-standard-4"
        minCount: 0
        maxCount: 5

Custom Node Pools

Define node pool with machine type g2-standard-4 and disk type pd-extreme that autoscales from 0 to 5 nodes.

cluster:
  gcp:
    additionalNodePools:
      - name: "gpu-pool"
        machineType: "g2-standard-4"
        diskType: pd-extreme
        minCount: 0
        maxCount: 5
        labels:
          e2: "true"

Enable Spot Instances

Warning

Spot instances are not recommended for running workloads that are not fault tolerant.

Note

Spot works well if you need to cut down costs for your clusters. See minimising costs.

Cluster setup with minimal costs

Create a cluster with Spot instances, HDD disk and e2-medium disk type.

cluster:
  gcp:
    additionalNodePools:
      - name: "spot-pool"
        machineType: "e2-medium"
        diskType: "pd-standard"
        minCount: 0
        maxCount: 5
        spot: true

# Nodepool with 5 VMs using standard will cost $162 per month

# Nodepool with 5 VMs using spot nodes will cost $61 per month

Exclude Logs

Log exclusions

To change log retention or set up exclusion filters configure in your config.yaml:

platformLogging:
  logRetentionDays: 10
  exclusionFilters:
  - name: my-exclusion
    description: Excludes info messages from my-container
    filter: resource.labels.container_name = my-container AND severity = Info

Exclusion filters use the Logging query language

Note

Log entries are excluded after they are received by the Logging API. The number of API calls is not reduced by excluding log entries.

Overprovision Pods

Note

Overprovisioning works well if workloads need to be provisioned immediately. See why you need overprovisioning.

Overprovision Pods

Create 5 pause pods that will request 1 CPU and 200Mi of memory. In total it will reserve 5 CPU and 1Gi of memory

cluster:
  overprovisioning:
    replicas: 5
    cpu: "1"
    memory: "200Mi"

Setup Alerts

Note

Alerts on key metrics can be useful to reduce your time to detection of issues. See Platform Monitoring for what type of dashboards and metrics are available on the platform.

Alert notifications

To send alerts to a dedicated slack channel, configure a slack webhook in your config.yaml:

platform_monitoring:
  slack_alert_webhook: https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

Access the Traefik Dashboard

You can use the Traefik Dashboard to review e.g. the configuration of HTTP middleware in the platform.

It is located at https://traefik-dashboard.{{ internalServices.domain }}/dashboard/

Sync ArgoCD App

You can use ArgoCD UI to trigger sync for ArgoCD Application.

ArgoCD UI is located at https://argocd.{{ internalServices.domain }}

Since tenant resources are provisioned as parts of ArgoCD Application, it might help you to fix issues related to tenancy.

Here are the steps:

  • Login as admin user
    • To get the password run: kubectl -n argocd get secret argocd-initial-admin-secret -o json | jq .data.password -r | base64 -D
  • Pick an ArgoCD Application you want to sync:
  • You can either sync the whole application or select a specific resource:
  • Select options for sync and click Synchronize button. For example, Force and Replace will cause ArgoCD to recreate resources.