Home

Aks Auto Upgrades

November 2022

Automatic Cluster Upgrades Failing

Automatic patch upgrades for AKS clusters

Auto-upgrade is an AKS feature which means you can tell Azure to automatically upgrade your clusters in a certain way depending on what channel you opt for. Currently there are five channels to choose from none, patch, stable, rapid and node-image each of these channels and the effects are documented. We have chosen to enable auto-upgrade for patch versions.

The AKS upgrade process

During the upgrade process, AKS will:

Add a new buffer node (or as many nodes as configured in max surge) to the cluster that runs the specified Kubernetes version.
Cordon and drain one of the old nodes to minimize disruption to running applications. If you’re using max surge, it will cordon and drain as many nodes at the same time as the number of buffer nodes specified.
When the old node is fully drained, it will be re-imaged to receive the new version, and it will become the buffer node for the following node to be upgraded.
This process repeats until all nodes in the cluster have been upgraded.
At the end of the process, the last buffer node will be deleted, maintaining the existing agent node count and zone balance.

Pod Disruption Budgets (PDBs)

At a high level PDBs are easy to understand. You set them to tell Kubernetes that you want to have a Minimum amount of Pods available or a Maximum amount Pods unavailable.

Examples of how PDBs work

If we wanted a PDB that always made sure we had 2 Pods available for a particular deployment with a label matching app=zookeeper then we could create that with the config below.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

Say we had 3 Pods in our zookeeper deployment K8s would know that its only allowed one voluntary disruption. Say, for example, that you were draining a node and two of the Pods belonging to our zookeeper deployment were running on that node, K8s would know to only delete and reschedule one Pod at a time.

The problem

When an AKS node is drained there are a few conditions that have to be met for the drain to happen successfully. One condition that needs to be met is that none of the pods being drained from the node are associated with a Pod Disruption Budgets (PDB) that has an ALLOWED DISRUPTIONS value of 0. If there is a pod associated with a PDB with no ALLOWED DISRUPTIONS then the draining of the node will be blocked and if something isn’t done to change that then the drain action will timeout and the upgrade will fail. It only takes a single PDB with 0 ALLOWED DISRUPTIONS to stop a node from being drained, at the time of writing this we had 783 pods deployed and 170 PDBs configured on the CFT Demo cluster and 20 of those PDBs had an ALLOWED DISRUPTIONS value of 0. So you can see how easily not having a “clean” cluster will stop auto-upgrades from happening.

Whilst working on a ticket, I realised that PDBs were causing the auto-upgrades to not happen due to there being failed or pending deployments. Some of the environments weren’t too bad but once I hit environments like Dev, ITHC and Demo, I had to do quite a lot of manual work to get the cluster into an upgradeable state.

I’m certain that this issue with the PDBs will also have an effect when running cluster upgrades manually for the Major and Minor versions, although I haven’t tested that theory.

Solution 1

I believe we should remove PDBs in some environments and build a operator/tool that constantly checks the health of Deployments being targeted by PDBs. I’ve been unable to find a tool that will deal with this scenario. Because the tool would also have to consider that we use Flux (v1 & v2) and it will try to overwrite any changes that are not made in code in our Flux Repos, I think it may be difficult to find one.

Identify any PDBs with an ALLOWED DISRUPTIONS value of 0
Check deployment to see if there are any pods in a Ready state.
Suspend Flux syncing on the kustomization or HR from where the resources are being deployed.
Nullify the PDBs control by one of the following (not an exhaustive list)
1. Scale down the bad deployment to 0
2. Remove the PDB (Sledgehammer option)
3. Update the PDB to 100% maxUnavailable or 0 minAvailable (Probably the best option)
Annotate the HRs, PDBs and/or Deployments so teams can identify why syncing is suspended and why resources have changed.
Notify the correct team that their Flux deployment is

This is an example the environments might be slightly different after a discussion. | Env | PDBs Enabled | Operator/Tool | |—————|————–|—————| | DEV/PREVIEW | NO | NO | | TEST/PERFTEST | NO | YES | | ITHC | YES | YES | | DEMO | YES | YES | | STAGING/AAT | YES | NO | | PROD | YES | NO |

Solution 2

Don’t use PDBs in most environments and deal with any issues manually. We should probably have something in place that alerts when a PDB has an ALLOWED DISRUPTIONS value of 0, but doesn’t make any changes to resources.

Env	PDBs Enabled	Operator/Tool	Monitoring Tool
DEV/PREVIEW	NO	NO	NO
TEST/PERFTEST	NO	NO	NO
ITHC	NO	NO	NO
DEMO	NO	NO	NO
STAGING/AAT	YES	NO	YES
PROD	YES	NO	YES

Solution 3

Turn off auto-upgrades for patch versions. I think we will still have this issue when manually upgrading the cluster, but that needs to be proved. We can work around this issue by destroying and recreating clusters.

Links

Disruptions concept - k8s

Configuring a PDB - k8s

AKS Auto Upgrades - Azure

Semantic Versioning