Around 12:30, I get Intray notifications saying that my services are unreachable. I browse to
https://cs-syd.eu and don't get a response. I try to ssh into the machine, and the command hangs.
Because I had a medical appointment at 13:00, I decided I would try to reboot the EC2 instance that hosts these services, and investigate when I got back.
After the services were rebooted, all services responded with
502: Bad Gateway.
Once I got back from the medical appointment, and still saw the
502 responses, I tried
ssh-ing into the machine again. This time I was able to get in, and run
The status showed
degraded, as expected. I noticed that
nginx was still running, which explained the
502 responses. I also noticed that none of my services were running.
intray-production-web-server service, for example, was restart-looping. I saw an error in the
journalctl logs that said
cannot execute binary file: Exec format error.
This is an error that occurs when you try to run a 64-bit executable on a 32-bit platform. I'd never seen this error on this machine before, and machines don't magically change their architecture, so I figured "that can't be it".
Next up, I ran
file on the executable, and saw (newlines added for clarity):
nix/store/0pb973llpgg42159g5x48bsrz8hw0gvw-intray-web-server-0.0.0.0/bin/intray-web-server: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/ld-linux-x86-64.so.2, for GNU/Linux 3.10.0, stripped
This was definitely a 64-bit executable. I also tried to
ldd it, and saw:
$ ldd /nix/store/0pb973llpgg42159g5x48bsrz8hw0gvw-intray-web-server-0.0.0.0/bin/intray-web-server not a dynamic executable
This was extremely weird because I knew that this wasn't a statically linked executable. I double-checked it by copying that executable to my local machine, and indeed on my machine I saw:
~/debugging $ ldd intray-web-server linux-vdso.so.1 (0x00007ffe083e1000) libsqlite3.so.0 => /nix/store/64pyqgnjar1cdqg2l5kim7bngqpggx63-sqlite-3.41.2/lib/libsqlite3.so.0 (0x00007f391199b000) libpthread.so.0 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libpthread.so.0 (0x00007f3911996000) libz.so.1 => /nix/store/hr3m53r0nhyqx80sg0bz9xjgk6jg009k-zlib-1.2.13/lib/libz.so.1 (0x00007f3911977000) libgmp.so.10 => /nix/store/0ssnwyy41aynhav7jr4dz1y01lfzi86f-gmp-with-cxx-6.2.1/lib/libgmp.so.10 (0x00007f39118d7000) libc.so.6 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libc.so.6 (0x00007f39116ef000) libm.so.6 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libm.so.6 (0x00007f391160f000) librt.so.1 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/librt.so.1 (0x00007f391160a000) libdl.so.2 => /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/libdl.so.2 (0x00007f3911605000) libffi.so.8 => /nix/store/mnq0hqsqivdbaqzmzc287l0z9zw8dp15-libffi-3.4.4/lib/libffi.so.8 (0x00007f39115f8000) /nix/store/yaz7pyf0ah88g2v505l38n0f3wg2vzdj-glibc-2.37-8/lib/ld-linux-x86-64.so.2 => /nix/store/46m4xx889wlhsdj72j38fnlyyvvvvbyb-glibc-2.37-8/lib64/ld-linux-x86-64.so.2 (0x00007f3911aec000)
I looked at
journalctl as well, and saw this:
Failed to create cgroup /system.slice/intray-web-server-production.service: Cannot allocate memory
btop (which worked) showed that I was only using 25% of the machine's available RAM.
I also run
df -h and the disk was not full either:
$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 100M 0 100M 0% /dev tmpfs 991M 0 991M 0% /dev/shm tmpfs 496M 3.3M 492M 1% /run tmpfs 991M 408K 990M 1% /run/wrappers /dev/disk/by-label/nixos 32G 14G 17G 45% / tmpfs 199M 0 199M 0% /run/user/0
At this point I thought that maybe my disk had been corrupted and the nix store might have been missing files or parts of files. However, the
intray-web-server file I got from the server matched the
intray-web-server file that I got from my own nix store.
I also tried to
nix copy all the relevant store paths to the machine again. I saw the disk usage increasing, so I thought that might have been the solution, but in the end it did not change anything.
Architecture rabbit hole
I decided to investigate the "wrong architecture" idea again.
I first tried
nix-info and was surprised to see:
This seriously surprised me because my
system = "x86_64-linux" and that had worked before. I tried figuring out where
nix-info got this information but could not figure it out easily. Using
strace did not show me any files that it'd be reading this information from. (In hindsight I now know that programs can ask this information from the kernel directly, but I hadn't thought to look for that in the
I was still convinced that this information was wrong, so I tried looking up the processor info in
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
I looked up this chip and saw that it was indeed a 64-bit processor:
It turns out that you can run a 32-bit kernel on a 64-bit processor. But that would mean that I was running a 32-bit kernel and I'd configured
system to be
I looked into
/run/booted-system/system and, to my great surprise, found:
I somehow managed to deploy a 32-bit system to a 64-bit machine, without noticing, and only find out an unknown amount of time later.
I double-checked that I was building a 64-bit system in my flake:
$ nix build .#nixosConfigurations.default.config.system.build.toplevel $ cat result/system x86_64-linux
At this point I was considering that maybe the kernel had magically changed size. (I thought I was going a bit crazy, to be honest.) But, as my favourite engineer likes to remind me:
There is no magic, you can always go figure it out.
So I continued to investigate.
Deploying with NixOps
I tried to deploy my machine again, but this time without any of the services:
Deployment succeeded, so I celebrated! Then I rebooted the instance and checked
/run/current-system/system and saw:
How could this be? I had just built an
x86_64-linux system, but by the time I'd deployed it, it had turned into an
I hypothesised that maybe the automation that changes the boot sequence was built for a 64-bit system and would therefore fail to change the boot sequence when run on a 32-bit system.
To rule this out, I tried manually adding an option to
/boot/grub/grub.cfg based on the system that I had
nix copy-ed over to the machine manually. When I rebooted now, the machine booted into a 64-bit system, yay! Indeed,
Except after I redeployed again with
i686-linux again. This meant I could exclude the "boot sequence automation" hypothesis.
It also meant that I had my next lead:
nixops. I tried running
nixops deploy with
--build-only to see what it built, and to my great surprise it built something different than what was in my flake.
nixops deploy deployed a 32-bit system eventhough I had clearly specified
system = "x86_64-linux". I tried specifying
system in different places; The
pgks that NixOps came from, the deployement, and the machine specification itself. None of these options had any impact on whether NixOps would build a 32-bit system.
At this point I spent a bit more time trying to get
nixops to build a 64-bit system, but in the end I realised that I could "just" use
nixos-rebuild to fix the issue:
$ nixos-rebuild switch --flake .#default --target-host root@instance --build-host localhost
In hindsight I realised that the original "wrong architecture" error made a lot of sense because I was running a 32-bit kernel.
It also made sense
nix-info disagreed because I was running the wrong kernel for the system.
It made a lot of sense that the instance did not switch into the system that I wasn't deploying.
I theorise that, with the switch from NixOps 1 to NixOps 2, I'd been deploying a 32-bit kernel but still running a 64-bit kernel because I'd never rebooted the machine. This would explain why the problem never showed up until I rebooted the instance.
I still don't understand why
ldd showed the binary to be a static executable. Maybe because a 32-bit
ldd can't investigate 64-bit executables but just silently shows an incorrect result?
I still don't know why
nixops could only build a 32-bit executable. For the record, this is
nixopsUnstable (NixOps 2), so I guess I should have expected things to break. I just hadn't expected that things would break months later and behind my back.
What went smoothly
My monitoring noticed the problem within minutes, so I am very happy about how quickly that went.
nixos-rebuild switch --target-hostwent extremely smoothly. The command built the right machine, deployed it to the right place, put the right grub configuration in place, and all services came up again.
What could have gone more smoothly
Investigate what the error tells you first, not last.
The error was pretty clearly telling me that there was an architecture issue, but I dismissed it as "that's probably not it". Instead I could have checked excluded that possibility first and I'd have found the problem sooner.
Fix the issue first, especially if it's faster, then fix the cause.
Once I found out the problem was NixOps building the wrong system, I could have immediately deployed using an alternative method like
nixos-rebuildinstead of trying to fix my NixOps configuration. This would have saved about an hour of downtime.
Closely monitor why the machine locked up. The original problem is probably still there and I still have no idea what it was.
Switch away from
nixops. I will probably deploy infrastructure with
terraformand software with
nixos-rebuildor something similarly simple. I used NixOps because it can do infrastructure and software deployments together, but given that I not only have static infrastructure, but specifically only 1 server, it makes no sense to take on this extra complexity.
Consider more frequently rebooting the machine to find this kind of problem earlier.
12:30: Alarms go off saying that cs-syd.eu is unreachable.
12:45: Instance reboot.
13:00: I have a medical appointment.
14:30: I get back from my medical appointment and start investigating.
18:00: I deploy with an alternative deployment tool; all services are up again.
18:30: Informed a NixOps maintainer.
20:35: A NixOps maintainer submits a PR with a workaround
I owe big thanks to nh2, who spent at least two hours in a call with me to help me debug it. The outage would have taken many hours longer to fix without him.