All CS SYD services (Intray, Tickler, Smos, Social Dance Today, ...) were down on 2023-08-08 between 12:30 and 18:00 Zürich time. This is a postmortem of the outage.
Full story
Symptoms
Around 12:30, I get Intray notifications saying that my services are unreachable. I browse to https://cs-syd.eu
and don't get a response. I try to ssh into the machine, and the command hangs.
Because I had a medical appointment at 13:00, I decided I would try to reboot the EC2 instance that hosts these services, and investigate when I got back.
After the services were rebooted, all services responded with 502: Bad Gateway
.
Investigation
Once I got back from the medical appointment, and still saw the 502
responses, I tried ssh
-ing into the machine again. This time I was able to get in, and run systemctl status
.
The status showed degraded
, as expected. I noticed that nginx
was still running, which explained the 502
responses. I also noticed that none of my services were running.
Indeed the intray-production-web-server
service, for example, was restart-looping. I saw an error in the journalctl
logs that said cannot execute binary file: Exec format error
.
This is an error that occurs when you try to run a 64-bit executable on a 32-bit platform. I'd never seen this error on this machine before, and machines don't magically change their architecture, so I figured "that can't be it".
Next up, I ran file
on the executable, and saw (newlines added for clarity):
nix/store/0pb973llpgg42159g5x48bsrz8hw0gvw-intray-web-server-0.0.0.0/bin/intray-web-server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/ld-linux-x86-64.so.2, for GNU/Linux 3.10.0, stripped
This was definitely a 64-bit executable. I also tried to ldd
it, and saw:
$ ldd /nix/store/0pb973llpgg42159g5x48bsrz8hw0gvw-intray-web-server-0.0.0.0/bin/intray-web-server not a dynamic executable
This was extremely weird because I knew that this wasn't a statically linked executable. I double-checked it by copying that executable to my local machine, and indeed on my machine I saw:
~/debugging $ ldd intray-web-server linux-vdso.so.1 (0x00007ffe083e1000) libsqlite3.so.0 => /nix/store/64pyqgnjar1cdqg2l5kim7bngqpggx63-sqlite-3.41.2/lib/libsqlite3.so.0 (0x00007f391199b000) libpthread.so.0 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libpthread.so.0 (0x00007f3911996000) libz.so.1 => /nix/store/hr3m53r0nhyqx80sg0bz9xjgk6jg009k-zlib-1.2.13/lib/libz.so.1 (0x00007f3911977000) libgmp.so.10 => /nix/store/0ssnwyy41aynhav7jr4dz1y01lfzi86f-gmp-with-cxx-6.2.1/lib/libgmp.so.10 (0x00007f39118d7000) libc.so.6 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6 (0x00007f39116ef000) libm.so.6 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libm.so.6 (0x00007f391160f000) librt.so.1 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/librt.so.1 (0x00007f391160a000) libdl.so.2 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libdl.so.2 (0x00007f3911605000) libffi.so.8 => /nix/store/mnq0hqsqivdbaqzmzc287l0z9zw8dp15-libffi-3.4.4/lib/libffi.so.8 (0x00007f39115f8000) /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/ld-linux-x86-64.so.2 => /nix/store/46m4xx889wlhsdj72j38fnlyyvvvvbyb-glibc-2.37-8/lib64/ld-linux-x86-64.so.2 (0x00007f3911aec000)
I looked at journalctl
as well, and saw this:
Failed to create cgroup /system.slice/intray-web-server-production.service: Cannot allocate memory
However btop
(which worked) showed that I was only using 25% of the machine's available RAM.
I also run df -h
and the disk was not full either:
$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 100M 0 100M 0% /dev tmpfs 991M 0 991M 0% /dev/shm tmpfs 496M 3.3M 492M 1% /run tmpfs 991M 408K 990M 1% /run/wrappers /dev/disk/by-label/nixos 32G 14G 17G 45% / tmpfs 199M 0 199M 0% /run/user/0
At this point I thought that maybe my disk had been corrupted and the nix store might have been missing files or parts of files. However, the intray-web-server
file I got from the server matched the intray-web-server
file that I got from my own nix store.
I also tried to nix copy
all the relevant store paths to the machine again. I saw the disk usage increasing, so I thought that might have been the solution, but in the end it did not change anything.
Architecture rabbit hole
I decided to investigate the "wrong architecture" idea again.
I first tried nix-info
and was surprised to see:
system: 'i686-linux'
This seriously surprised me because my flake.nix
said system = "x86_64-linux"
and that had worked before. I tried figuring out where nix-info
got this information but could not figure it out easily. Using strace
did not show me any files that it'd be reading this information from. (In hindsight I now know that programs can ask this information from the kernel directly, but I hadn't thought to look for that in the strace
output.)
I was still convinced that this information was wrong, so I tried looking up the processor info in /proc/cpuinfo
:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
I looked up this chip and saw that it was indeed a 64-bit processor:
How could /proc/cpuinfo
and nix-info
disagree?!
It turns out that you can run a 32-bit kernel on a 64-bit processor. But that would mean that I was running a 32-bit kernel and I'd configured system
to be x86-64_linux
?!
I looked into /run/booted-system/system
and, to my great surprise, found:
i686-linux
I somehow managed to deploy a 32-bit system to a 64-bit machine, without noticing, and only find out an unknown amount of time later.
I double-checked that I was building a 64-bit system in my flake:
$ nix build .#nixosConfigurations.default.config.system.build.toplevel $ cat result/system x86_64-linux
At this point I was considering that maybe the kernel had magically changed size. (I thought I was going a bit crazy, to be honest.) But, as my favourite engineer likes to remind me:
There is no magic, you can always go figure it out.
So I continued to investigate.
Deploying with NixOps
I tried to deploy my machine again, but this time without any of the services:
nixops deploy
Deployment succeeded, so I celebrated! Then I rebooted the instance and checked /run/current-system/system
and saw:
i686-linux
How could this be? I had just built an x86_64-linux
system, but by the time I'd deployed it, it had turned into an i686-linux
system.
I hypothesised that maybe the automation that changes the boot sequence was built for a 64-bit system and would therefore fail to change the boot sequence when run on a 32-bit system.
To rule this out, I tried manually adding an option to /boot/grub/grub.cfg
based on the system that I had nix copy
-ed over to the machine manually. When I rebooted now, the machine booted into a 64-bit system, yay! Indeed, nix-info
showed x86_64-linux
.
Except after I redeployed again with nixops deploy
, nix-info
showed i686-linux
again. This meant I could exclude the "boot sequence automation" hypothesis.
It also meant that I had my next lead: nixops
. I tried running nixops deploy
with --build-only
to see what it built, and to my great surprise it built something different than what was in my flake.
Indeed nixops deploy
deployed a 32-bit system eventhough I had clearly specified system = "x86_64-linux"
. I tried specifying system
in different places; The pgks
that NixOps came from, the deployement, and the machine specification itself. None of these options had any impact on whether NixOps would build a 32-bit system.
At this point I spent a bit more time trying to get nixops
to build a 64-bit system, but in the end I realised that I could "just" use nixos-rebuild
to fix the issue:
$ nixos-rebuild switch --flake .#default --target-host root@instance --build-host localhost
Hindsight
In hindsight I realised that the original "wrong architecture" error made a lot of sense because I was running a 32-bit kernel.
It also made sense /proc/cpuinfo
and nix-info
disagreed because I was running the wrong kernel for the system.
It made a lot of sense that the instance did not switch into the system that I wasn't deploying.
I theorise that, with the switch from NixOps 1 to NixOps 2, I'd been deploying a 32-bit kernel but still running a 64-bit kernel because I'd never rebooted the machine. This would explain why the problem never showed up until I rebooted the instance.
Remaining unknowns
I still don't understand why ldd
showed the binary to be a static executable. Maybe because a 32-bit ldd
can't investigate 64-bit executables but just silently shows an incorrect result?
I still don't know why nixops
could only build a 32-bit executable. For the record, this is nixopsUnstable
(NixOps 2), so I guess I should have expected things to break. I just hadn't expected that things would break months later and behind my back.
Review
What went smoothly
My monitoring noticed the problem within minutes, so I am very happy about how quickly that went.
Deploying with
nixos-rebuild switch --target-host
went extremely smoothly. The command built the right machine, deployed it to the right place, put the right grub configuration in place, and all services came up again.
What could have gone more smoothly
Investigate what the error tells you first, not last.
The error was pretty clearly telling me that there was an architecture issue, but I dismissed it as "that's probably not it". Instead I could have checked excluded that possibility first and I'd have found the problem sooner.
Fix the issue first, especially if it's faster, then fix the cause.
Once I found out the problem was NixOps building the wrong system, I could have immediately deployed using an alternative method like
nixos-rebuild
instead of trying to fix my NixOps configuration. This would have saved about an hour of downtime.
Next steps
Closely monitor why the machine locked up. The original problem is probably still there and I still have no idea what it was.
Switch away from
nixops
. I will probably deploy infrastructure withterraform
and software withnixos-rebuild
or something similarly simple. I used NixOps because it can do infrastructure and software deployments together, but given that I not only have static infrastructure, but specifically only 1 server, it makes no sense to take on this extra complexity.Consider more frequently rebooting the machine to find this kind of problem earlier.
Timeline
12:30: Alarms go off saying that cs-syd.eu is unreachable.
12:45: Instance reboot.
13:00: I have a medical appointment.
14:30: I get back from my medical appointment and start investigating.
18:00: I deploy with an alternative deployment tool; all services are up again.
18:30: Informed a NixOps maintainer.
20:35: A NixOps maintainer submits a PR with a workaround
Acknowledgments
I owe big thanks to nh2, who spent at least two hours in a call with me to help me debug it. The outage would have taken many hours longer to fix without him.