Post Mortem of the 7th July 2023 Jenkins Infrastructure Outage

Damien DUPORTAL July 12, 2023

On Friday 7th of July 2023, the Jenkins infrastructure suffered a major outage from 11:05am UTC until 15:25pm UTC with complete downtime of the following public services:

accounts.jenkins.io
fallback.get.jenkins.io
get.jenkins.io
incrementals.jenkins.io
javadoc.jenkins.io
plugin-health.jenkins.io
plugin-site-issues.jenkins.io
plugins.origin.jenkins.io
plugins.jenkins.io
rating.jenkins.io
repo.azure.jenkins.io
reports.jenkins.io
stories.jenkins.io
uplink.jenkins.io
weekly.ci.jenkins.io
www.origin.jenkins.io

We also had complete downtime of the following non-public services:

ldap.jenkins.io
previews of *.jenkins.io pull requests (infra.ci.jenkins.io)

In addition, there was disruption (partial or complete) of the following services:

ci.jenkins.io
infra.ci.jenkins.io
issues.jenkins.io
plugins.jenkins.io
repo.jenkins-ci.org
www.jenkins.io

The public IPs of these services changed (DNS records included) to:

20.7.178.24 (IPv4)
2603:1030:408:5::15a (IPv6)

⚠️ Update your corporate networks (DNS, proxies, firewall) if you need to access these services.

Incident Timeline

10:30am UTC: After a successful upgrade of the public Kubernetes cluster in Azure to 1.25 (as part of help desk ticket 3582), we realized that the LDAP service was not reachable by the services running inside the cluster (such as accounts.jenkins.io). We quickly identified IP restrictions blocking these requests as the pod originating IP was in a different range than before.
10:55am UTC: The fix (Azure PR 431) is deployed to specify a proper set of IP ranges for the pods. It removed all of the node pools (all the virtual machines where the container was running) and failed to re-create them, causing a full outage of all the services running in this cluster:
- accounts.jenkins.io
- get.jenkins.io
- incrementals.jenkins.io
- javadoc.jenkins.io
- jenkins-wiki-exporter.jenkins.io
- ldap.jenkins.io
- plugins.jenkins.io
- previews of *.jenkins.io pull requests (infra.ci.jenkins.io)
- release.ci.jenkins.io
- repo.azure.jenkins.io
- reports.jenkins.io
- stories.jenkins.io
- uplink.jenkins.io
- www.jenkins.io
11:16am until 13:16pm UTC: The failure to re-create resources led us to spend the 2 next hours creating the cluster from scratch with a fixed network setup.
15:17pm UTC: This pull request is pushed to persist the manual work we did to recreate the cluster including the IP setup.
15:25pm UTC: All services are back to normal

What Happened?

When the cluster was initially created, we selected the 10.244.0.0/16 virtual network IP range (ref. Azure VNets) with a 10.245.0.0/24 sub-network (ref. Azure subnet).
But we ignored that the 10.244.0.0/24 range is the default CIDR for the Kubernetes Pods network in Azure when using the "kubenet" network to support IPv6 instead of the default CNI.
The node pool re-creation failed because we assumed both ranges were able to communicate and tried to deploy an invalid setup.
As soon as we specified a custom Pod CIDR in a distinct range, everything went fine.
When the original cluster was deleted it transitively removed the current Public IPs, as it removed the Nodes Resource Group containing the Public IP.
- These public IPs should change as little as possible to avoid problems with our users running behind a corporate firewall with an allow-list.

What can we do to improve?

As per our initial assessment: protect the Public IPs from deletion by adding a Management Lock.
As recommended by other contributors: storing the Public IP in a distinct Resource Group and set up the Kubernetes-managed Load Balancers accordingly (annotation service.beta.kubernetes.io/azure-load-balancer-resource-group).
Improve our network diagrams and documentation to have better access to the representation and potential overlaps when preparing operations.
Avoid changing AKS node pools configurations all at once: we would have caught the issue after the first node pool and could have avoided a full outage (we are working on this topic for the arm64 node pools in PR-3623).

From 0 to production in less than 4 hours!

One of the takeaways of this outage, is that we are able to recover from a full destruction of the cluster hosting almost all public services in less than 4 hours.

It’s a huge collaborative work which allowed this: from defining the architecture, building the infrastructure, backing-up its data, etc.

This huge effort started years ago by R. Tyler Croy, Olivier Vernin and backed by a lot of contributors such as Daniel Beck, Hervé Le Meur, Tim Jacomb, Mark E Waite, Stéphane Merle and many more.

As current Infrastructure Officer, I want to thank them all so that our life is easier when catastrophic events happens!

About the author

Damien DUPORTAL

Damien is the Jenkins Infrastructure officer and a software engineer at CloudBees working as a Site Reliability Engineer for the Jenkins Infrastructure project. Not only he is a decade-old Hudson/Jenkins user but also an open-source citizen who participates in Updatecli, Asciidoctor, Traefik and many others.