Aks Auto Upgrades
November 2022
Automatic Cluster Upgrades Failing
Automatic patch upgrades for AKS clusters
Auto-upgrade is an AKS feature which means you can tell Azure to automatically upgrade your clusters in a certain way depending on what channel you opt for. Currently there are five channels to choose from none
, patch
, stable
, rapid
and node-image
each of these channels and the effects are documented. We have chosen to enable auto-upgrade for patch
versions.
The AKS upgrade process
During the upgrade process, AKS will:
- Add a new buffer node (or as many nodes as configured in max surge) to the cluster that runs the specified Kubernetes version.
- Cordon and drain one of the old nodes to minimize disruption to running applications. If you’re using max surge, it will cordon and drain as many nodes at the same time as the number of buffer nodes specified.
- When the old node is fully drained, it will be re-imaged to receive the new version, and it will become the buffer node for the following node to be upgraded.
- This process repeats until all nodes in the cluster have been upgraded.
- At the end of the process, the last buffer node will be deleted, maintaining the existing agent node count and zone balance.
Pod Disruption Budgets (PDBs)
At a high level PDBs are easy to understand. You set them to tell Kubernetes that you want to have a Minimum amount of Pods available or a Maximum amount Pods unavailable.
If we wanted a PDB that always made sure we had 2 Pods available for a particular deployment with a label matching app=zookeeper
then we could create that with the config below.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: zookeeper
Say we had 3 Pods in our zookeeper deployment K8s would know that its only allowed one voluntary disruption. Say, for example, that you were draining a node and two of the Pods belonging to our zookeeper deployment were running on that node, K8s would know to only delete and reschedule one Pod at a time.
The problem
When an AKS node is drained there are a few conditions that have to be met for the drain to happen successfully. One condition that needs to be met is that none of the pods being drained from the node are associated with a Pod Disruption Budgets (PDB) that has an ALLOWED DISRUPTIONS
value of 0
. If there is a pod associated with a PDB with no ALLOWED DISRUPTIONS
then the draining of the node will be blocked and if something isn’t done to change that then the drain action will timeout and the upgrade will fail. It only takes a single PDB with 0 ALLOWED DISRUPTIONS
to stop a node from being drained, at the time of writing this we had 783
pods deployed and 170
PDBs configured on the CFT Demo cluster and 20
of those PDBs had an ALLOWED DISRUPTIONS
value of 0
. So you can see how easily not having a “clean” cluster will stop auto-upgrades from happening.
Whilst working on a ticket, I realised that PDBs were causing the auto-upgrades to not happen due to there being failed or pending deployments. Some of the environments weren’t too bad but once I hit environments like Dev, ITHC and Demo, I had to do quite a lot of manual work to get the cluster into an upgradeable state.
I’m certain that this issue with the PDBs will also have an effect when running cluster upgrades manually for the Major and Minor versions, although I haven’t tested that theory.
Solution 1
I believe we should remove PDBs in some environments and build a operator/tool that constantly checks the health of Deployments being targeted by PDBs. I’ve been unable to find a tool that will deal with this scenario. Because the tool would also have to consider that we use Flux (v1 & v2) and it will try to overwrite any changes that are not made in code in our Flux Repos, I think it may be difficult to find one.
- Identify any PDBs with an
ALLOWED DISRUPTIONS
value of0
- Check deployment to see if there are any pods in a
Ready
state. - Suspend Flux syncing on the kustomization or HR from where the resources are being deployed.
- Nullify the PDBs control by one of the following (not an exhaustive list)
- Scale down the bad deployment to 0
- Remove the PDB (Sledgehammer option)
- Update the PDB to 100%
maxUnavailable
or 0minAvailable
(Probably the best option)
- Annotate the HRs, PDBs and/or Deployments so teams can identify why syncing is suspended and why resources have changed.
- Notify the correct team that their Flux deployment is
This is an example the environments might be slightly different after a discussion. | Env | PDBs Enabled | Operator/Tool | |—————|————–|—————| | DEV/PREVIEW | NO | NO | | TEST/PERFTEST | NO | YES | | ITHC | YES | YES | | DEMO | YES | YES | | STAGING/AAT | YES | NO | | PROD | YES | NO |
Solution 2
Don’t use PDBs in most environments and deal with any issues manually. We should probably have something in place that alerts when a PDB has an ALLOWED DISRUPTIONS
value of 0
, but doesn’t make any changes to resources.
Env | PDBs Enabled | Operator/Tool | Monitoring Tool |
---|---|---|---|
DEV/PREVIEW | NO | NO | NO |
TEST/PERFTEST | NO | NO | NO |
ITHC | NO | NO | NO |
DEMO | NO | NO | NO |
STAGING/AAT | YES | NO | YES |
PROD | YES | NO | YES |
Solution 3
Turn off auto-upgrades for patch versions. I think we will still have this issue when manually upgrading the cluster, but that needs to be proved. We can work around this issue by destroying and recreating clusters.