Lifestyle

Archinstall and NVMe issues.

This article addresses #1495, which at a first glance was an issue of partitions/targets being mounted twice.
While that was one of the symptoms, other symptoms (#1503, #1506, #1516, #1522) started popping up after the release of v2.5.1.

Timeline

Reproducing the issue

Given that the initial report lacked a install.log it made the issue into a guessing game a little bit. And since the issue was posted well before the release of Arch Linux monthly ISO it was thought to be a one-off and priority was put a bit on the back burner (this is common if there isn't enough details in the error report and it's in junction with tagging a new archinstall release).
Nevertheless, debugging began 8 hours after the initial report and focus was on partitions already being mounted and how that can happen.
Angle of attack was to make a few installations to see if we could trigger it.
After some 20 installations in a isolated QEMU environment it became apparent that this was not going to work.

An educated guess was therefor made and a initial fix was produced. By updating the Partition() objects cache information about itself after a .mount() call was made, it was believed that it would solve the issue.
Due to life circumstances and in prepration for the wave of issues that usually comes after Arch Linux ISO is officially released containing the new version debugging was slower than usual and the issue was thought to be resolved.
On 2022-10-04 (6 days after first issue and 4 days after the offical ISO is released) the first issue containing a error log appeared. And this is where things got strange.

It became apparent that there was no way for us to reproduce the issue (in a productive way) but the issue was apparent. And it wasn't isolated to mounts being done twice at this point.
Luckily, on 2022-10-05 day a community member by the name of Betablockr (GitHub) stepped in as he could reliably reproduce it with every installation attempt and had some free time over. Communication was done over discord and narrowing down the issue began.

Key findings

At first, there was nothing sticking out in the installation logs (In hindseight we overlooked the Partition() object stairing us in the face all along). But Betablockr (GitHub) and a discord user by the name vegancookies concluded that it had to be something related to NVMe drives.
And more reports started coming on, all sharing the same trait.. NVMe (discord) drives were the thread.

Luckily Torxed (GitHub) had several of NVMe setups ready to go. The bad news were, none of them could recreate the issue.

It wasn't until on the 11:th day where the final key finding was made:

Formatting /dev/nvme0n1p1 -> vfat
...
Formatting /dev/nvme0n1p1 -> ext4
				
And checking the mount statement a similar issue could be seen:

Mounting Partition(path=/dev/nvme0n1p1, size=0.5, PARTUUID=f451088e-c21d-4067-94dc-1a898a697062, fs=ext4) to /mnt/archinstall/
Mounting Partition(path=/dev/nvme0n1p1, size=0.5, PARTUUID=f451088e-c21d-4067-94dc-1a898a697062, fs=vfat) to /mnt/archinstall/boot
			

And this is the thing we overlooked in the original screenshot.
It did not look like an error as the format was indicative of usual output behavior.
But looking at the Partition() objects mountpoints / and /boot it was clear both shared the same PARTUUID and device-path which shouldn't be possible.

Workarounds

The first workaround was to add a bit of code that would unmount the drives before what we assumed was the attempt to re-mount the same partition twice.

The second (and proper) workaround was to:

This was supplied as disk issues has historically taken a lot of time to diagnose, correct and package in a proper release. With that out there what's left was..

Solving the problem

Solving it historically has been done with a time.sleep() in strategic places, as this is a common way to solve disk operations where certain tasks have to wait for cache data to update.
Avoiding that was important, instead we want to rely on waiting the smallest ammount possible while still ensuring the data we need is in the cache.

Finding the change that broke it was hard, as there was over 6 881 changes and some two thousand deletions. And at the time of creating a fix, fuling a bit of the confusion was reports of "setting a root password first cured the issue" or that the issue was related to other hardware as well. But one important code block that was removed in favor of optimization was between the version were:

for i in range(storage['DISK_RETRY_ATTEMPTS']):
	lsblk_info = SysCommand(f"lsblk --json -b -o+LOG-SEC,SIZE,PTTYPE,PARTUUID,UUID,FSTYPE {self.device_path}").decode('UTF-8')
	...
	if all(requirements):
		break

	self.partition_object.partprobe()
	time.sleep(max(0.1, storage['DISK_TIMEOUTS'] * i))
			

We stuck with this and reintroducing it where it was removed proved challanging, as the overall disk operation logic had change significantly.
But the code was migrated into the new data model for partition information.
This will re-introduce a slight delay for every disk information that is re-created. But the logic should sleep for the least possible time while ensuring sufficient time is given to the kernel to update its cache between partprobe

Leassons learned