Tag Archives: Projects

A re-introduction to mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/a-re-introduction-to-mkosi-a-tool-for-generating-os-images.html

This is a guest post written by Daan De Meyer, systemd and mkosi
maintainer

Almost 7 years ago, Lennart first
wrote
about mkosi on this blog. Some years ago, I took over development and
there’s been a huge amount of changes and improvements since then. So I
figure this is a good time to re-introduce mkosi.

mkosi stands for Make Operating
System Image
. It generates OS images that can be used for a variety of
purposes.

If you prefer watching a video over reading a blog post, you can also
watch my presentation on
mkosi at All Systems Go 2023.

What is mkosi?

mkosi was originally written as a tool to simplify hacking on systemd
and for experimenting with images using many of the new concepts being
introduced in systemd at the time. In the meantime, it has evolved into
a general purpose image builder that can be used in a multitude of
scenarios.

Instructions to install mkosi can be found in its
readme. We
recommend running the latest version to take advantage of all the latest
features and bug fixes. You’ll also need bubblewrap and the package
manager of your favorite distribution to get started.

At its core, the workflow of mkosi can be divided into 3 steps:

  1. Generate an OS tree for some distribution by installing a set of
    packages.
  2. Package up that OS tree in a variety of output formats.
  3. (Optionally) Boot the resulting image in qemu or systemd-nspawn.

Images can be built for any of the following distributions:

  • Fedora Linux
  • Ubuntu
  • OpenSUSE
  • Debian
  • Arch Linux
  • CentOS Stream
  • RHEL
  • Rocky Linux
  • Alma Linux

And the following output formats are supported:

  • GPT disk images built with systemd-repart
  • Tar archives
  • CPIO archives (for building initramfs images)
  • USIs (Unified System Images which are full OS images packed in a UKI)
  • Sysext, confext and portable images
  • Directory trees

For example, to build an Arch Linux GPT disk image and boot it in
qemu, you can run the following command:

$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu

To instead boot the image in systemd-nspawn, replace qemu with boot:

$ mkosi -d arch -p systemd -p udev -p linux -t disk boot

The actual image can be found in the current working directory named
image.raw. However, using a separate output directory is recommended
which is as simple as running mkdir mkosi.output.

To rebuild the image after it’s already been built once, add -f to the
command line before the verb to rebuild the image. Any arguments passed
after the verb are forwarded to either systemd-nspawn or qemu
itself. To build the image without booting it, pass build instead of
boot or qemu or don’t pass a verb at all.

By default, the disk image will have an appropriately sized root
partition and an ESP partition, but the partition layout and contents
can be fully customized using systemd-repart by creating partition
definition files in mkosi.repart/. This allows you to customize the
partition as you see fit:

  • The root partition can be encrypted.
  • Partition sizes can be customized.
  • Partitions can be protected with signed dm-verity.
  • You can opt out of having a root partition and only have a /usr
    partition instead.
  • You can add various other partitions, e.g. an XBOOTLDR partition or a
    swap partition.

As part of building the image, we’ll run various tools such as
systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and
more to make sure the image is set up correctly.

Configuring mkosi image builds

Naturally with extended use you don’t want to specify all settings on
the command line every time, so mkosi supports configuration files
where the same settings that can be specified on the command line can be
written down.

For example, the command we used above can be written down in a
configuration file mkosi.conf:

[Distribution]
Distribution=arch

[Output]
Format=disk

[Content]
Packages=
        systemd
        udev
        linux

Like systemd, mkosi uses INI configuration files. We also support
dropins which can be placed in mkosi.conf.d. Configuration files can
also be conditionalized using the [Match] section. For example, to
only install a specific package on Arch Linux, you can write the
following to mkosi.conf.d/10-arch.conf:

[Match]
Distribution=arch

[Content]
Packages=pacman

Because not everything you need will be supported in mkosi, we support
running scripts at various points during the image build process where
all extra image customization can be done. For example, if it is found,
mkosi.postinst is called after packages have been installed. Scripts
are executed on the host system by default (in a sandbox), but can be
executed inside the image by suffixing the script with .chroot, so if
mkosi.postinst.chroot is found it will be executed inside the image.

To add extra files to the image, you can place them in mkosi.extra in
the source directory and they will be automatically copied into the
image after packages have been installed.

Bootable images

If the necessary packages are installed, mkosi will automatically
generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it
will always build
UKIs
(Unified Kernel Images), except if the image is BIOS-only (since UKIs
cannot be used on BIOS). The initramfs is built like a regular image by
installing distribution packages and packaging them up in a CPIO archive
instead of a disk image. Specifically, we do not use dracut,
mkinitcpio or initramfs-tools to generate the initramfs from the
host system. ukify is used to assemble all the individual components
into a UKI.

If you don’t want mkosi to generate a bootable image, you can set
Bootable=no to explicitly disable this logic.

Using mkosi for development

The main requirements to use mkosi for development is that we can
build our source code against the image we’re building and install it
into the image we’re building. mkosi supports this via build scripts.
If a script named mkosi.build (or mkosi.build.chroot) is found,
we’ll execute it as part of the build. Any files put by the build script
into $DESTDIR will be installed into the image. Required build
dependencies can be installed using the BuildPackages= setting. These
packages are installed into an overlay which is put on top of the image
when running the build script so the build packages are available when
running the build script but don’t end up in the final image.

An example mkosi.build.chroot script for a project using meson could
look as follows:

#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
    meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"

Now, every time the image is built, the build script will be executed
and the results will be installed into the image.

The $BUILDDIR environment variable points to a directory that can be
used as the build directory for build artifacts to allow for incremental
builds if the build system supports it.

Of course, downloading all packages from scratch every time and
re-installing them again every time the image is built is rather slow,
so mkosi supports two modes of caching to speed things up.

The first caching mode caches all downloaded packages so they don’t have
to be downloaded again on subsequent builds. Enabling this is as simple
as running mkdir mkosi.cache.

The second mode of caching caches the image after all packages have been
installed but before running the build script. On subsequent builds,
mkosi will copy the cache instead of reinstalling all packages from
scratch. This mode can be enabled using the Incremental= setting.
While there is some rudimentary cache invalidation, the cache can also
forcibly be rebuilt by specifying -ff on the command line instead of
-f.

Note that when running on a btrfs filesystem, mkosi will automatically
use subvolumes for the cached images which can be snapshotted on
subsequent builds for even faster rebuilds. We’ll also use reflinks to
do copy-on-write copies where possible.

With this setup, by running mkosi -f qemu in the systemd repository,
it takes about 40 seconds to go from a source code change to a root
shell in a virtual machine running the latest systemd with your change
applied. This makes it very easy to test changes to systemd in a safe
environment without risk of breaking your host system.

Of course, while 40 seconds is not a very long time, it’s still more
than we’d like, especially if all we’re doing is modifying the kernel
command line. That’s why we have the KernelCommandLineExtra= option to
configure kernel command line options that are passed to the container
or virtual machine at runtime instead of being embedded into the image.
These extra kernel command line options are picked up when the image is
booted with qemu’s direct kernel boot (using -append), but also when
booting a disk image in UEFI mode (using SMBIOS). The same applies to
systemd credentials (using the Credentials= setting). These settings
allow configuring the image without having to rebuild it, which means
that you only have to run mkosi qemu or mkosi boot again afterwards
to apply the new settings.

Building images without root privileges and loop devices

By using newuidmap/newgidmap and systemd-repart, mkosi is able to
build images without needing root privileges. As long as proper subuid
and subgid mappings are set up for your user in /etc/subuid and
/etc/subgid, you can run mkosi as your regular user without having
to switch to root.

Note that as of the writing of this blog post this only applies to the
build and qemu verbs. Booting the image in a systemd-nspawn
container with mkosi boot still needs root privileges. We’re hoping to
fix this in an future systemd release.

Regardless of whether you’re running mkosi with root or without root,
almost every tool we execute is invoked in a sandbox to isolate as much
of the build process from the host as possible. For example, /etc and
/var from the host are not available in this sandbox, to avoid host
configuration inadvertently affecting the build.

Because systemd-repart can build disk images without loop devices,
mkosi can run from almost any environment, including containers. All
that’s needed is a UID range with 65536 UIDs available, either via
running as the root user or via /etc/subuid and newuidmap. In a
future systemd release, we’re hoping to provide an alternative to
newuidmap and /etc/subuid to allow running mkosi from all
containers, even those with only a single UID available.

Supporting older distributions

mkosi depends on very recent versions of various systemd tools (v254 or
newer). To support older distributions, we implemented so called tools
trees. In short, mkosi can first build a tools image for you that
contains all required tools to build the actual image. This can be
enabled by adding ToolsTree=default to your mkosi configuration.
Building a tools image does not require a recent version of systemd.

In the systemd mkosi configuration, we automatically use a tools tree if
we detect your distribution does not have the minimum required systemd
version installed.

Configuring variants of the same image using profiles

Profiles can be defined in the mkosi.profiles/ directory. The profile
to use can be selected using the Profile= setting (or --profile=) on
the command line. A profile allows you to bundle various settings behind
a single recognizable name. Profiles can also be matched on if you want
to apply some settings only to a few profiles.

For example, you could have a bootable profile that sets
Bootable=yes, adds the linux and systemd-boot packages and
configures Format=disk to end up with a bootable disk image when
passing --profile bootable on the kernel command line.

Building system extension images

System extension
images may – dynamically at runtime — extend the base system with an
overlay containing additional files.

To build system extensions with mkosi, we need a base image on top of
which we can build our extension.

To keep things manageable, we’ll make use of mkosi‘s support for
building multiple images so that we can build our base image and system
extension in one go.

We start by creating a temporary directory with a base configuration
file mkosi.conf with some shared settings:

[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache

Now let’s continue with the base image definition by writing the
following to mkosi.images/base/mkosi.conf:

[Output]
Format=directory

[Content]
CleanPackageMetadata=no
Packages=systemd
         udev

We use the directory output format here instead of the disk output
so that we can build our extension without needing root privileges.

Now that we have our base image, we can define a sysext that builds on
top of it by writing the following to mkosi.images/btrfs/mkosi.conf:

[Config]
Dependencies=base

[Output]
Format=sysext
Overlay=yes

[Content]
BaseTrees=%O/base
Packages=btrfs-progs

BaseTrees= point to our base image and Overlay=yes instructs mkosi
to only package the files added on top of the base tree.

We can’t sign the extension image without a key. We can generate one
by running mkosi genkey which will generate files that are
automatically picked up when building the image.

Finally, you can build the base image and the extensions by running
mkosi -f. You’ll find btrfs.raw in mkosi.output which is the
extension image.

Various other interesting features

  • To sign any generated UKIs for secure boot, put your secure boot key
    and certificate in mkosi.key and mkosi.crt and enable the
    SecureBoot= setting. You can also run mkosi genkey to have mkosi
    generate a key and certificate itself.
  • The Ephemeral= setting can be enabled to boot the image in an
    ephemeral copy that is thrown away when the container or virtual
    machine exits.
  • ShimBootloader= and BiosBootloader= settings are available to
    configure shim and grub installation if needed.
  • mkosi can boot directory trees in a virtual using virtiofsd. This
    is very useful for quickly rebuilding an image and booting it as the
    image does not have to be packed up as a disk image.

There’s many more features that we won’t go over in detail here in this
blog post. Learn more about those by reading the
documentation.

Conclusion

I’ll finish with a bunch of links to more information about mkosi and
related tooling:

ASG! 2023 CfP Closes Soon

Post Syndicated from original https://0pointer.net/blog/asg-2023-cfp-closes-soon.html

The All Systems Go! 2023 Call for Participation Closes in Three Days!

The Call for Participation (CFP) for All Systems Go!
2023
will close in three days, on 7th of
July! We’d like to invite you to submit your proposals for
consideration to the CFP submission
site
quickly!

ASG image

All topics relevant to foundational open-source Linux technologies are
welcome. In particular, however, we are looking for proposals
including, but not limited to, the following topics:

The CFP will close on July 7th, 2023. A response will be sent to all
submitters on or before July 14th, 2023. The conference takes place in
🗺️ Berlin, Germany 🇩🇪 on Sept. 13-14th.

All Systems Go! 2023 is all about foundational open-source Linux
technologies. We are primarily looking for deeply technical talks by
and for developers, engineers and other technical roles.

We focus on the userspace side of things, so while kernel topics are
welcome they must have clear, direct relevance to userspace. The
following is a non-comprehensive list of topics encouraged for 2023
submissions:

  • Image-Based Linux 🖼️
  • Secure and Measured Boot 📏
  • TPM-Based Local/Remote Attestation, Encryption, Authentication 🔑
  • Low-level container executors and infrastructure ⚙️.
  • IoT, embedded and server Linux infrastructure
  • Reproducible builds 🔧
  • Package management, OS, container 📦, image delivery and updating
  • Building Linux devices and applications 🏗️
  • Low-level desktop 💻 technologies
  • Networking 🌐
  • System and service management 🚀
  • Tracing and performance measuring 🔍
  • IPC and RPC systems 🦜
  • Security 🔐 and Sandboxing 🏖️

For more information please visit our conference
website
!

Get kids creating webpages with HTML and CSS

Post Syndicated from Rik Cross original https://www.raspberrypi.org/blog/learning-html-and-css/

With our new free ‘Introduction to web development’ path, young people are able to learn HTML and create their own webpages on topics that matter to them. The path is made up of six projects that show children and teenagers how to structure pages using HTML, and style them using CSS. 

At Coolest Projects, a young person explores a coding project.

With all the website tools available today, why learn HTML? 

Webpage creation has come a long way since the 1990s, but HTML is still the markup language that is used to display almost every page on the World Wide Web. By knowing how it works, you can deepen your understanding of the technology you use every day.

If you want to build your own website today, there are many tools to get you quickly up and running. These tools often involve dragging and dropping predefined elements and choosing from a wide collection of themed looks. Learning HTML and CSS skills is important for web designers, developers, and content creators who want to build unique webpage designs that make their content stand out.

Six webpages, each with a unique design and based on a topic important to the creator.
The path helps young people express themselves through their own webpages

With our new ‘Introduction to web development’ path, we want creators (the young people who use our projects) to be able to quickly make fantastic-looking websites that follow modern best practices, while they also learn how HTML and CSS work together to create a webpage. Creators write their own HTML to develop the content and structure of their webpages. And they customise our pre-built CSS style sheets to get their webpages to look like they imagine.

This really is a fun and unique approach to learning HTML and building a webpage, and we think young people will quickly engage with it. They start by finding out how to structure pages using HTML before applying CSS styles that bring their pages to life. Through the six projects, they build all the skills and independence they need to make webpages that matter to them. 

Accessibility first

We believe that young people should find out about website accessibility right from the start of their learning journey. That’s why the path for learning HTML shows creators how they can make their websites accessible to all their users regardless of the users’ needs or digital devices.

That’s why our new path uses semantic HTML. Older HTML tutorials might show you how to structure a webpage using tags like <div> and <span>. In contrast, the meaning and purpose of tags in semantic HTML is very clear. For example:

  • <main> is used to tag the main content for the webpage
  • <footer> is used for content to be displayed in the footer
  • <blockquote> contains a quote and typically the author of the quote
  • <section> contains a portion of content that usually sits within the main part of the webpage

Semantic HTML supports accessibility because it allows people who use a screen reader to more easily navigate a webpage and read it in a logical way. 

Another element of accessible design that the path introduces is the colour combinations used on webpages. It is really important that contrasting colours are used for the background and the text. High contrast makes the text more readable, which means the webpage is more suitable for visually impaired users. 

Good and bad examples of colour contrasting on webpages.
It’s very important to use contrasting colours on a webpage

The path also shows creators the importance of adding meaningful alternative text for images. Good alternative text helps visually impaired users, and users who have a very low bandwidth and therefore turn images off in their web browser. 

With the path, young people will learn how to design webpages that respond to the device of the user

Finally, our path for learning HTML introduces creators to the concept of responsive web design. Responsive design is helpful because websites can be viewed on thousands of different devices. Some people view pages on large, high-resolution monitors, and others view them on a mobile phone screen. We show learners how they can use HTML and CSS to make their pages responsive so they display in the way that works best for the specific screen on which a user is viewing them.

Key questions answered

Who is the ‘Intro to web development’ path for?

We have written the projects in this path with young people of around the age from 9 to 17 in mind. 

HTML and CSS are text-based markup languages. This means a young person who wants to start learning HTML needs to be familiar with typing on a keyboard. It would also be helpful to have experience of using the copy and paste function, which is useful when changing the layout of a page or copying similar pieces of code. 

Young people attending a Dojo.

If a young person is unsure whether they have the right skills to get started with the path, they can first try out a short ‘Discover’ project. With this Discover project, young people can choose between the themes ‘space’, ‘sunsets’, ‘forests’, or ‘animals’ to see how they can create their first webpage in just five steps. (We’re still working on the ‘Discover’ project type, so if you have any feedback about it, let us know.)

An example step from the Discover project, forest theme.
Young people can experiment with our Discover project to build their own webpage in just a few steps

What will young people learn with the path?

Creators will learn how to use HTML and CSS to build webpages that have:

  • Images
  • Lists
  • Quotes 
  • Links 
  • Animations
  • Imported fonts

They will also learn about how to make their webpages accessible to all through use of:

  • Semantic HTML
  • Alternative text for images
  • Colour contrast checking
  • Responsive design (means the webpage adapts to the device on which it is viewed)

How long does the path take to complete?

We’ve designed the path so young people can complete it in six one-hour sessions, with one hour for each project. Since the project instructions encourage creators to upgrade their projects, they may wish to go further and spend a little more time getting their projects exactly as they imagine them. 

A CoderDojo coding session for young people.

What software is needed to create the projects in the path?

Young people only need a standard web browser to follow the project instructions and use an online code editor to create their webpages. 

What can young people do next?

Explore our other projects for learning HTML

There are 28 other step-by-step projects for creators to choose from on our website. They can browse through these to see what cool things they’d like to make and what new skills they want to learn.

Build a webpage for Coolest Projects 

If your kid is proud of the webpage they create with the final ‘Invent’ project in the path, they can share it with a worldwide community of young creators in our free Coolest Projects tech showcase. Project registration will open again in spring 2023. You can sign up to hear news about the showcase on the Coolest Projects homepage.

Two teenage girls participating in Coolest Projects shows off their tech project.
Details about the projects in ‘Intro to web development’

The ‘Intro to web development’ path is structured according to our Digital Making Framework, with three Explore projects, two Design projects, and a final Invent project. You can also check out our learning graph to to see the progression of young people’s skills and knowledge throughout the path.

Explore project 1: Anime expressions



In the ‘Anime expressions’ project, creators build and style a webpage for an anime drawing tutorial. They learn how to use HTML tags to structure a webpage; use CSS to apply layout, colours, and fonts; and add images and text content to their page.  

Explore project 2: Top 5 emojis



With the ‘Top 5 emojis’ project, young people create a webpage displaying their top 5 list of emojis. They learn how to add emojis, create a list, use a block quote, and animate elements of the page. 

Explore project 3: Flip treat webcards



With the ‘Flip treat webcards’ project, creators make a webpage showing a flip card with a treat from around the world. They use CSS to make the card flip over when a user interacts with it. Creators also learn how to apply gradients and import fonts from Google Fonts

Design project 1: Mood board



This Design project gives creators the chance to develop the skills that they have learned in the three ‘Explore’ projects. With the ‘Mood board’ project, young people create a webpage to display a mood board for a real or imaginary project. The mood board could, for example, show ideas for a party, a fashion item, a redesign of their bedroom, or a website; or it could show reminders of all the things that make them happy. 

Design project 2: Sell me something

 




The ‘Sell me something’ project is another chance for creators to practise the skills that they have gained in the ‘Explore’ projects. They create a webpage to ‘sell something’ to the webpages visitors. It could be anything they like, from an object they love, to a game they like to play. 

Invent project: Build a webpage

 




The ‘Build a webpage’ project is the final project in the path and allows young people to independently build a webpage on any topic they’re interested in. This Invent project offers info cards to remind creators of the key skills they’ve learned with the path, and a light structure to support them through the process of making their webpage. Young people are encouraged to showcase their final webpages in the path gallery to inspire other creators. 

The post Get kids creating webpages with HTML and CSS appeared first on Raspberry Pi.

Linux Boot Partitions

Post Syndicated from original https://0pointer.net/blog/linux-boot-partitions.html

💽 Linux Boot Partitions and How to Set Them Up 🚀

Let’s have a look how traditional Linux distributions set up
/boot/ and the ESP, and how this could be improved.

How Linux distributions traditionally have been setting up their
“boot” file systems has been varying to some degree, but the most
common choice has been to have a separate partition mounted to
/boot/. Usually the partition is formatted as a Linux file system
such as ext2/ext3/ext4. The partition contains the kernel images, the
initrd and various boot loader resources. Some distributions, like
Debian and Ubuntu, also store ancillary files associated with the
kernel here, such as kconfig or System.map. Such a traditional
boot partition is only defined within the context of the distribution,
and typically not immediately recognizable as such when looking just
at the partition table (i.e. it uses the generic Linux partition type
UUID).

With the arrival of UEFI a new partition relevant for boot appeared,
the EFI System Partition (ESP). This partition is defined by the
firmware environment, but typically accessed by Linux to install or
update boot loaders. The choice of file system is not up to Linux, but
effectively mandated by the UEFI specifications: vFAT. In theory it
could be formatted as other file systems too. However, this would
require the firmware to support file systems other than vFAT. This is
rare and firmware specific though, as vFAT is the only file system
mandated by the UEFI specification. In other words, vFAT is the only
file system which is guaranteed to be universally supported.

There’s a major overlap of the type of the data typically stored in
the ESP and in the traditional boot partition mentioned earlier: a
variety of boot loader resources as well as kernels/initrds.

Unlike the traditional boot partition, the ESP is easily recognizable
in the partition table via its GPT partition type UUID. The ESP is
also a shared resource: all OSes installed on the same disk will
share it and put their boot resources into them (as opposed to the
traditional boot partition, of which there is one per installed Linux
OS, and only that one will put resources there).

To summarize, the most common setup on typical Linux distributions is
something like this:

Type Linux Mount Point File System Choice
Linux “Boot” Partition /boot/ Any Linux File System, typically ext2/ext3/ext4
ESP /boot/efi/ vFAT

As mentioned, not all distributions or local installations agree on
this. For example, it’s probably worth mentioning that some
distributions decided to put kernels onto the root file system of the
OS itself. For this setup to work the boot loader itself [sic!] must
implement a non-trivial part of the storage stack. This may have to
include RAID, storage drivers, networked storage, volume management,
disk encryption, and Linux file systems. Leaving aside the conceptual
argument that complex storage stacks don’t belong in boot loaders
there are very practical problems with this approach. Reimplementing
the Linux storage stack in all its combinations is a massive amount of
work. It took decades to implement what we have on Linux now, and it
will take a similar amount of work to catch up in the boot loader’s
reimplementation. Moreover, there’s a political complication: some
Linux file system communities made clear they have no interest in
supporting a second file system implementation that is not maintained
as part of the Linux kernel.

What’s interesting is that the /boot/efi/ mount point is nested
below the /boot/ mount point. This effectively means that to access
the ESP the Boot partition must exist and be mounted first. A system
with just an ESP and without a Boot partition hence doesn’t fit well
into the current model. The Boot partition will also have to carry an
empty “efi” directory that can be used as the inner mount point, and
serves no other purpose.

Given that the traditional boot partition and the ESP may carry
similar data (i.e. boot loader resources, kernels, initrds) one may
wonder why they are separate concepts. Historically, this was the
easiest way to make the pre-UEFI way how Linux systems were booted
compatible with UEFI: conceptually, the ESP can be seen as just a
minor addition to the status quo ante that way. Today, primarily two
reasons remained:

  • Some distributions see a benefit in support for complex Linux file
    system concepts such as hardlinks, symlinks, SELinux labels/extended
    attributes and so on when storing boot loader resources. – I
    personally believe that making use of features in the boot file
    systems that the firmware environment cannot really make sense of is
    very clearly not advisable. The UEFI file system APIs know no
    symlinks, and what is SELinux to UEFI anyway? Moreover, putting more
    than the absolute minimum of simple data files into such file
    systems immediately raises questions about how to authenticate them
    comprehensively (including all fancy metadata) cryptographically on
    use (see below).

  • On real-life systems that ship with non-Linux OSes the ESP often
    comes pre-installed with a size too small to carry multiple Linux
    kernels and initrds. As growing the size of an existing ESP is
    problematic (for example, because there’s no space available
    immediately after the ESP, or because some low-quality firmware
    reacts badly to the ESP changing size) placing the kernel in a
    separate, secondary partition (i.e. the boot partition) circumvents
    these space issues.

File System Choices

We already mentioned that the ESP effectively has to be vFAT, as that
is what UEFI (more or less) guarantees. The file system choice for the
boot partition is not quite as restricted, but using arbitrary Linux
file systems is not really an option either. The file system must be
accessible by both the boot loader and the Linux OS. Hence only file
systems that are available in both can be used. Note that such
secondary implementations of Linux file systems in the boot
environment – limited as they may be – are not typically welcomed
or supported by the maintainers of the canonical file system
implementation in the upstream Linux kernel. Modern file systems are
notoriously complicated and delicate and simply don’t belong in boot
loaders.

In a trusted boot world, the two file systems for the ESP and the
/boot/ partition should be considered untrusted: any code or
essential data read from them must be authenticated cryptographically
before use. And even more, the file system structures themselves are
also untrusted. The file system driver reading them must be careful
not to be exploitable by a rogue file system image. Effectively this
means a simple file system (for which a driver can be more easily
validated and reviewed) is generally a better choice than a complex
file system (Linux file system communities made it pretty clear that
robustness against rogue file system images is outside of their scope
and not what is being tested for.).

Some approaches tried to address the fact that boot partitions are
untrusted territory by encrypting them via a mechanism compatible to
LUKS, and adding decryption capabilities to the boot loader so it can
access it. This misses the point though, as encryption does not imply
authentication, and only authentication is typically desired. The boot
loader and kernel code are typically Open Source anyway, and hence
there’s little value in attempting to keep secret what is already
public knowledge. Moreover, encryption implies the existence of an
encryption key. Physically typing in the decryption key on a keyboard
might still be acceptable on desktop systems with a single human user
in front, but outside of that scenario unlock via TPM, PKCS#11 or
network services are typically required. And even on the desktop FIDO2
unlocking is probably the future. Implementing all the technologies
these unlocking mechanisms require in the boot loader is not
realistic, unless the boot loader shall become a full OS on its own as
it would require subsystems for FIDO2, PKCS#11, USB, Bluetooth
network, smart card access, and so on.

File System Access Patterns

Note that traditionally both mentioned partitions were read-only
during most parts of the boot. Only later, once the OS is up, write
access was required to implement OS or boot loader updates. In today’s
world things have become a bit more complicated. A modern OS might
want to require some limited write access already in the boot loader,
to implement boot counting/boot assessment/automatic fallback (e.g.,
if the same kernel fails to boot 3 times, automatically revert to
older kernel), or to maintain an early storage-based random seed. This
means that even though the file system is mostly read-only, we need
limited write access after all.

vFAT cannot compete with modern Linux file systems such as btrfs
when it comes to data safety guarantees. It’s not a journaled file
system, does not use CoW or any form of checksumming. This means when
used for the system boot process we need to be particularly careful
when accessing it, and in particular when making changes to it (i.e.,
trying to keep changes local to single sectors). It is essential to
use write patterns that minimize the chance of file system
corruption. Checking the file system (“fsck”) before modification
(and probably also reading) is important, as is ensuring the file
system is put into a “clean” state as quickly as possible after each
modification.

Code quality of the firmware in typical systems is known to not always
be great. When relying on the file system driver included in the
firmware it’s hence a good idea to limit use to operations that have a
better chance to be correctly implemented. For example, when writing
from the UEFI environment it might be wise to avoid any operation that
requires allocation algorithms, but instead focus on access patterns
that only override already written data, and do not require allocation
of new space for the data.

Besides write access from the boot loader code (as described above)
these file systems will require write access from the OS, to
facilitate boot loader and kernel/initrd updates. These types of
accesses are generally not fully random accesses (i.e., never partial
file updates) but usually mean adding new files as whole, and removing
old files as a whole. Existing files are typically not modified once
created, though they might be replaced wholly by newer versions.

Boot Loader Updates

Note that the update cycle frequencies for boot loaders and for
kernels/initrds are probably similar these days. While kernels are
still vastly more complex than boot loaders, security issues are
regularly found in both. In particular, as boot loaders (through
“shim” and similar components) carry certificate/keyring and denylist
information, which typically require frequent updates. Update cycles
hence have to be expected regularly.

Boot Partition Discovery

The traditional boot partition was not recognizable by looking just at
the partition table. On MBR systems it was directly referenced from
the boot sector of the disk, and on EFI systems from information
stored in the ESP. This is less than ideal since by losing this
entrypoint information the system becomes unbootable. It’s typically a
better, more robust idea to make boot partitions recognizable as such
in the partition table directly. This is done for the ESP via the GPT
partition type UUID. For traditional boot partitions this was not done
though.

Current Situation Summary

Let’s try to summarize the above:

  • Currently, typical deployments use two distinct boot partitions,
    often using two distinct file system implementations

  • Firmware effectively dictates existence of the ESP, and the use of
    vFAT

  • In userspace view: the ESP mount is nested below the general
    Boot partition mount

  • Resources stored in both partitions are primarily kernel/initrd, and
    boot loader resources

  • The mandatory use of vFAT brings certain data safety challenges,
    as does quality of firmware file system driver code

  • During boot limited write access is needed, during OS runtime
    more comprehensive write access is needed (though still not fully
    random).

  • Less restricted but still limited write patterns from OS
    environment
    (only full file additions/updates/removals, during
    OS/boot loader updates)

  • Boot loaders should not implement complex storage stacks.

  • ESP can be auto-discovered from the partition table, traditional
    boot partition cannot.

  • ESP and the traditional boot partition are not protected
    cryptographically neither in structure nor contents. It is expected
    that loaded files are individually authenticated after being read.

  • The ESP is a shared resource — the traditional boot partition a
    resource specific to each installed Linux OS on the same disk.

How to Do it Better

Now that we have discussed many of the issues with the status quo ante, let’s see how we can do things better:

  • Two partitions for essentially the same data is a bad idea. Given
    they carry data very similar or identical in nature, the common case
    should be to have only one (but see below).

  • Two file system implementations are worse than one. Given that vFAT
    is more or less mandated by UEFI and the only format universally
    understood by all players, and thus has to be used anyway, it might
    as well be the only file system that is used.

  • Data safety is unnecessarily bad so far: both ESP and boot partition
    are continuously mounted from the OS, even though access is pretty
    restricted: outside of update cycles access is typically not
    required.

  • All partitions should be auto-discoverable/self-descriptive

  • The two partitions should not be exposed as nested mounts to userspace

To be more specific, here’s how I think a better way to set this all up would look like:

  • Whenever possible, only have one boot partition, not two. On EFI
    systems, make it the ESP. On non-EFI systems use an XBOOTLDR
    partition instead (see below). Only have both in the case where a
    Linux OS is installed on a system that already contains an OS with
    an ESP that is too small to carry sufficient kernels/initrds. When a
    system contains a XBOOTLDR partition put kernels/initrd on that,
    otherwise the ESP.

  • Instead of the vaguely defined, traditional Linux “boot” partition
    use the XBOOTLDR partition type as defined by the Discoverable
    Partitions
    Specification
    . This
    ensures the partition is discoverable, and can be automatically
    mounted by things like
    systemd-gpt-auto-generator. Use
    XBOOTLDR only if you have to, i.e., when dealing with systems that
    lack UEFI (and where the ESP hence has no value) or to address the
    mentioned size issues with the ESP. Note that unlike the traditional
    boot partition the XBOOTLDR partition is a shared resource, i.e.,
    shared between multiple parallel Linux OS installations on the same
    disk. Because of this it is typically wise to place a per-OS
    directory at the top of the XBOOTLDR file system to avoid conflicts.

  • Use vFAT for both partitions, it’s the only thing
    universally understood among relevant firmwares and Linux. It’s
    simple enough to be useful for untrusted storage. Or to say this
    differently: writing a file system driver that is not easily
    vulnerable to rogue disk images is much easier for vFAT than for
    let’s say btrfs. – But the choice of vFAT implies some care needs to
    be taken to address the data safety issues it brings, see below.

  • Mount the two partitions via the “automount
    logic. For example, via systemd’s
    automount
    units, with a very short idle time-out (one second or so). This
    improves data safety immensely, as the file systems will remain
    mounted (and thus possibly in a “dirty” state) only for very short
    periods of time, when they are actually accessed – and all that
    while the fact that they are not mounted continuously is mostly not
    noticeable for applications as the file system paths remain
    continuously around. Given that the backing file system (vFAT) has
    poor data safety properties, it is essential to shorten the access
    for unclean file system state as much as possible. In fact, this is
    what the aforementioned systemd-gpt-auto-generator
    logic actually does by default.

  • Whenever mounting one of the two partitions, do a file system check
    (fsck; in fact this is also what
    systemd-gpt-auto-generatordoes by default, hooked into
    the automount logic, to run on first access). This ensures that even
    if the file system is in an unclean state it is restored to be clean
    when needed, i.e., on first access.

  • Do not mount the two partitions nested, i.e., no
    more /boot/efi/. First of all, as mentioned above, it
    should be possible (and is desirable) to only have one of the
    two. Hence it is simply a bad idea to require the other as well,
    just to be able to mount it. More importantly though, by nesting
    them, automounting is complicated, as it is necessary to trigger the
    first automount to establish the second automount, which defeats the
    point of automounting them in the first place. Use the two distinct
    mount points /efi/ (for the ESP) and
    /boot/ (for XBOOTLDR) instead. You might have guessed,
    but that too is what systemd-gpt-auto-generator does by
    default.

  • When making additions or updates to ESP/XBOOTLDR from the OS make
    sure to create a file and write it in full, then
    syncfs() the whole file system, then rename to give it
    its final name, and syncfs() again. Similar when
    removing files.

  • When writing from the boot loader environment/UEFI to ESP/XBOOTLDR,
    do not append to files or create new files. Instead overwrite
    already allocated file contents (for example to maintain a random
    seed file) or rename already allocated files to include information
    in the file name (and ideally do not increase the file name in
    length; for example to maintain boot counters).

  • Consider adopting
    UKIs,
    which minimize the number of files that need to be updated on the
    ESP/XBOOTLDR during OS/kernel updates (ideally down to 1)

  • Consider adopting
    systemd-boot,
    which minimizes the number of files that need to be updated on boot
    loader updates (ideally down to 1)

  • Consider removing any mention of ESP/XBOOTLDR from
    /etc/fstab, and just let
    systemd-gpt-auto-generator do its thing.

  • Stop implementing file systems, complex storage, disk encryption, …
    in your boot loader.

Implementing things like that you gain:

  • Simplicity: only one file system implementation, typically only
    one partition and mount point

  • Robust auto-discovery of all partitions, no need to even
    configure /etc/fstab

  • Data safety guarantees as good as possible, given the
    circumstances

To summarize this in a table:

Type Linux Mount Point File System Choice Automount
ESP /efi/ vFAT yes
XBOOTLDR /boot/ vFAT yes

A note regarding modern boot loaders that implement the Boot Loader
Specification
:
both partitions are explicitly listed in the specification as sources
for both Type #1 and Type #2 boot menu entries. Hence, if you use such
a modern boot loader (e.g. systemd-boot) these two partitions are the
preferred location for boot loader resources, kernels and initrds
anyway.

Addendum: You got RAID?

You might wonder, what about RAID setups and the ESP? This comes up
regularly in discussions: how to set up the ESP so that (software)
RAID1 (mirroring) can be done on the ESP. Long story short: I’d
strongly advise against using RAID on the ESP. Firmware typically
doesn’t have native RAID support, and given that firmware and boot
loader can write to the file systems involved, any attempt to use
software RAID on them will mean that a boot cycle might corrupt the
RAID sync, and immediately requires a re-synchronization after
boot. If RAID1 backing for the ESP is really necessary, the only way
to implement that safely would be to implement this as a driver for
UEFI – but that creates certain bootstrapping issues (i.e., where to
place the driver if not the ESP, a file system the driver is supposed
to be used for), and also reimplements a considerable component of the
OS storage stack in firmware mode, which seems problematic.

So what to do instead? My recommendation would be to solve this via
userspace tooling. If redundant disk support shall be implemented for
the ESP, then create separate ESPs on all disks, and synchronize them
on the file system level instead of the block level. Or in other
words, the tools that install/update/manage kernels or boot loaders
should be taught to maintain multiple ESPs instead of one. Copy the
kernels/boot loader files to all of them, and remove them from all of
them. Under the assumption that the goal of RAID is a more reliable
system this should be the best way to achieve that, as it doesn’t
pretend the firmware could do things it actually cannot do. Moreover
it minimizes the complexity of the boot loader, shifting the syncing
logic to userspace, where it’s typically easier to get right.

Addendum: Networked Boot

The discussion above focuses on booting up from a local disk. When
thinking about networked boot I think two scenarios are particularly
relevant:

  1. PXE-style network booting. I think in this mode of operation focus
    should be on directly booting a single UKI image instead of a boot
    loader. This sidesteps the whole issue of maintaining any boot
    partition at all, and simplifies the boot process greatly. In
    scenarios where this is not sufficient, and an interactive boot
    menu or other boot loader features are desired, it might be a good
    idea to take inspiration from the UKI concept, and build a single
    boot loader EFI binary (such as systemd-boot), and include the UKIs
    for the boot menu items and other resources inside it via PE
    sections. Or in other words, build a single boot loader binary that
    is “supercharged” and contains all auxiliary resources in its own
    PE sections. (Note: this does not exist, it’s an idea I intend to
    explore with systemd-boot). Benefit: a single file has to be
    downloaded via PXE/TFTP, not more. Disadvantage: unused resources
    are downloaded unnecessarily. Either way: in this context there is
    no local storage, and the ESP/XBOOTLDR discussion above is without
    relevance.

  2. Initrd-style network booting. In this scenario the boot loader and
    kernel/initrd (better: UKI) are available on a local disk. The
    initrd then configures the network and transitions to a network
    share or file system on a network block device for the root file
    system. In this case the discussion above applies, and in fact the
    ESP or XBOOTLDR partition would be the only partition available
    locally on disk.

And this is all I have for today.

Brave New Trusted Boot World

Post Syndicated from original https://0pointer.net/blog/brave-new-trusted-boot-world.html

🔐 Brave New Trusted Boot World 🚀

This document looks at the boot process of general purpose Linux
distributions. It covers the status quo and how we envision Linux boot
to work in the future with a focus on robustness and simplicity.

This document will assume that the reader has comprehensive
familiarity with TPM 2.0 security chips and their capabilities (e.g.,
PCRs, measurements, SRK), boot loaders, the shimbinary, Linux,
initrds, UEFI Firmware, PE binaries, and SecureBoot.

Problem Description

Status quo ante of the boot logic on typical Linux distributions:

  • Most popular Linux distributions generate initrdslocally, and
    they are unsigned, thus not protected through SecureBoot (since that
    would require local SecureBoot key enrollment, which is generally
    not done), nor TPM PCRs.

  • Boot chain is typically Firmware →
    shimgrub → Linux kernel →
    initrd(dracut or similar) → root file system

  • Firmware’s UEFI SecureBoot protects shim, shim’s key management
    protects grub and kernel. No code signing protects initrd. initrd
    acquires the key for encrypted root fs from the user (or
    TPM/FIDO2/PKCS11).

  • shim/grub/kernel is measured into TPM PCR 4, among other stuff

  • EFI TPM event log reports measured data into TPM PCRs, and can be
    used to reconstruct and validate state of TPM PCRs from the used
    resources.

  • No userspace components are typically measured, except for what IMA
    measures

  • New kernels require locally generating new boot loader scripts and
    generating a new initrd each time. OS updates thus mean fragile
    generation of multiple resources and copying multiple files into the
    boot partition.

Problems with the status quo ante:

  • initrd typically unlocks root file system encryption, but is not
    protected whatsoever, and trivial to attack and modify offline

  • OS updates are brittle: PCR values of grub are very hard to
    pre-calculate, as grub measures chosen control flow path, not just
    code images. PCR values vary wildly, and OS provided resources are
    not measured into separate PCRs. Grub’s PCR measurements might be
    useful up to a point to reason about the boot after the fact, for
    the most basic remote attestation purposes, but useless for
    calculating them ahead of time during the OS build process (which
    would be desirable to be able to bind secrets to future expected PCR
    state, for example to bind secrets to an OS in a way that it remain
    accessible even after that OS is updated).

  • Updates of a boot loader are not robust, require multi-file updates
    of ESP and boot partition, and regeneration of boot scripts

  • No rollback protection (no way to cryptographically invalidate
    access to TPM-bound secrets on OS updates)

  • Remote attestation of running software is needlessly complex since
    initrds are generated locally and thus basically are guaranteed to
    vary on each system.

  • Locking resources maintained by arbitrary user apps to TPM state
    (PCRs) is not realistic for general purpose systems, since PCRs will
    change on every OS update, and there’s no mechanism to re-enroll
    each such resource before every OS update, and remove the old
    enrollment after the update.

  • There is no concept to cryptographically invalidate/revoke secrets
    for an older OS version once updated to a new OS version. An
    attacker thus can always access the secrets generated on old OSes if
    they manage to exploit an old version of the OS — even if a newer
    version already has been deployed.

Goals of the new design:

  • Provide a fully signed execution path from firmware to
    userspace, no exceptions

  • Provide a fully measured execution path from firmware to
    userspace, no exceptions

  • Separate out TPM PCRs assignments, by “owner” of measured
    resources, so that resources can be bound to them in a fine-grained
    fashion.

  • Allow easy pre-calculation of expected PCR values based on
    booted kernel/initrd, configuration, local identity of the system

  • Rollback protection

  • Simple & robust updates: one updated file per concept

  • Updates without requiring re-enrollment/local preparation of the
    TPM-protected resources (no more “brittle” PCR hashes that must be
    propagated into every TPM-protected resource on each OS update)

  • System ready for easy remote attestation, to prove validity of
    booted OS, configuration and local identity

  • Ability to bind secrets to specific phases of the boot, e.g. the
    root fs encryption key should be retrievable from the TPM only in
    the initrd, but not after the host transitioned into the root fs.

  • Reasonably secure, automatic, unattended unlocking of disk
    encryption secrets should be possible.

  • “Democratize” use of PCR policies by defining PCR register meanings,
    and making binding to them robust against updates, so that
    external projects can safely and securely bind their own data to
    them (or use them for remote attestation) without risking breakage
    whenever the OS is updated.

  • Build around TPM 2.0 (with graceful fallback for TPM-less
    systems if desired, but TPM 1.2 support is out of scope)

Considered attack scenarios and considerations:

  • Evil Maid: neither online nor offline (i.e. “at rest”), physical
    access to a storage device should enable an attacker to read the
    user’s plaintext data on disk (confidentiality); neither online nor
    offline, physical access to a storage device should allow undetected
    modification/backdooring of user data or OS (integrity), or
    exfiltration of secrets.

  • TPMs are assumed to be reasonably “secure”, i.e. can securely
    store/encrypt secrets. Communication to TPM is not “secure” though
    and must be protected on the wire.

  • Similar, the CPU is assumed to be reasonably “secure”

  • SecureBoot is assumed to be reasonably “secure” to permit validated
    boot up to and including shim+boot loader+kernel (but see discussion
    below)

  • All user data must be encrypted and authenticated. All vendor and
    administrator data must be authenticated.

  • It is assumed all software involved regularly contains
    vulnerabilities and requires frequent updates to address them, plus
    regular revocation of old versions.

  • It is further assumed that key material used for signing code by the
    OS vendor can reasonably be kept secure (via use of HSM, and
    similar, where secret key information never leaves the signing
    hardware) and does not require frequent roll-over.

Proposed Construction

Central to the proposed design is the concept of a Unified Kernel
Image (UKI)
. These UKIs are the combination of a Linux kernel image,
and initrd, a UEFI boot stub program (and further resources, see
below) into one single UEFI PE file that can either be directly
invoked by the UEFI firmware (which is useful in particular in some
cloud/Confidential Computing environments) or through a boot loader
(which is generally useful to implement support for multiple kernel
versions, with interactive or automatic selection of image to boot
into, potentially with automatic fallback management to increase
robustness).

UKI Components

Specifically, UKIs typically consist of the following resources:

  1. An UEFI boot stub that is a small piece of code still running in
    UEFI mode and that transitions into the Linux kernel included in
    the UKI (e.g., as implemented in
    sd-stub,
    see below)

  2. The Linux kernel to boot in the .linux PE section

  3. The initrd that the kernel shall unpack and invoke in the
    .initrd PE section

  4. A kernel command line string, in the .cmdline PE
    section

  5. Optionally, information describing the OS this kernel is intended
    for, in the .osrel PE section (derived from
    /etc/os-release of the booted OS). This is useful for
    presentation of the UKI in the boot loader menu, and ordering it
    against other entries, using the included version information.

  6. Optionally, information describing kernel release information
    (i.e. uname -r output) in the .uname PE
    section. This is also useful for presentation of the UKI in the
    boot loader menu, and ordering it against other entries.

  7. Optionally, a boot splash to bring to screen before transitioning
    into the Linux kernel in the .splash PE section

  8. Optionally, a compiled Devicetree database file, for systems which
    need it, in the .dtb PE section

  9. Optionally, the public key in PEM format that matches the
    signatures of the .pcrsig PE section (see below), in a
    .pcrpkey PE section.

  10. Optionally, a JSON file encoding expected PCR 11 hash values seen
    from userspace once the UKI has booted up, along with signatures
    of these expected PCR 11 hash values, matching a specific public
    key in the.pcrsigPE section. (Note: we use plural
    for “values” and “signatures” here, as this JSON file will
    typically carry a separate value and signature for each PCR bank
    for PCR 11, i.e. one pair of value and signature for the SHA1
    bank, and another pair for the SHA256 bank, and so on. This
    ensures when enrolling or unlocking a TPM-bound secret we’ll
    always have a signature around matching the banks available
    locally (after all, which banks the local hardware supports is up
    to the hardware). For the sake of simplifying this already overly
    complex topic, we’ll pretend in the rest of the text there was
    only one PCR signature per UKI we have to care about, even if this
    is not actually the case.)

Given UKIs are regular UEFI PE files, they can thus be signed as one
for SecureBoot, protecting all of the individual resources listed
above at once, and their combination. Standard Linux tools such as
sbsigntool and pesign can be used to sign
UKI files.

UKIs wrap all of the above data in a single file, hence all of the
above components can be updated in one go through single file atomic
updates, which is useful given that the primary expected storage place
for these UKIs is the UEFI System Partition (ESP), which is a vFAT
file system, with its limited data safety guarantees.

UKIs can be generated via a single, relatively simple objcopy
invocation, that glues the listed components together, generating one
PE binary that then can be signed for SecureBoot. (For details on
building these, see below.)

Note that the primary location to place UKIs in is the EFI System
Partition (or an otherwise firmware accessible file system). This
typically means a VFAT file system of some form. Hence an effective
UKI size limit of 4GiB is in place, as that’s the largest file size a
FAT32 file system supports.

Basic UEFI Stub Execution Flow

The mentioned UEFI stub program will execute the following operations
in UEFI mode before transitioning into the Linux kernel that is
included in its .linux PE section:

  1. The PE sections listed are searched for in the invoked UKI the stub
    is part of, and superficially validated (i.e. general file format is
    in order).

  2. All PE sections listed above of the invoked UKI are measured into
    TPM PCR 11. This TPM PCR is expected to be all zeroes before the UKI
    initializes. Pre-calculation is thus very straight-forward if the
    resources included in the PE image are known. (Note: as a single
    exception the .pcrsig PE section is excluded from this measurement,
    as it is supposed to carry the expected result of the measurement, and
    thus cannot also be input to it, see below for further details about
    this section.)

  3. If the .splash PE section is included in the UKI it is brought onto the screen

  4. If the .dtb PE section is included in the UKI it is activated
    using the Devicetree UEFI “fix-up” protocol

  5. If a command line was passed from the boot loader to the UKI
    executable it is discarded if SecureBoot is enabled and the command
    line from the .cmdline used. If SecureBoot is disabled and a
    command line was passed it is used in place of the one from
    .cmdline. Either way the used command line is measured into TPM
    PCR 12. (This of course removes any flexibility of control of the
    kernel command line of the local user. In many scenarios this is
    probably considered beneficial, but in others it is not, and some
    flexibility might be desired. Thus, this concept probably needs to
    be extended sooner or later, to allow more flexible kernel command
    line policies to be enforced via definitions embedded into the
    UKI. For example: allowing definition of multiple kernel command
    lines the user/boot menu can select one from; allowing additional
    allowlisted parameters to be specified; or even optionally allowing
    any verification of the kernel command line to be turned off even
    in SecureBoot mode. It would then be up to the builder of the UKI
    to decide on the policy of the kernel command line.)

  6. It will set a couple of volatile EFI variables to inform userspace
    about executed TPM PCR measurements (and which PCR registers were
    used), and other execution properties. (For example: the EFI
    variable StubPcrKernelImage in the
    4a67b082-0a4c-41cf-b6c7-440b29bb8c4f vendor namespace indicates
    the PCR register used for the UKI measurement, i.e. the value
    “11”).

  7. An initrd cpio archive is dynamically synthesized from the
    .pcrsig and .pcrpkey PE section data (this is later passed to
    the invoked Linux kernel as additional initrd, to be overlaid with
    the main initrd from the .initrd section). These files are later
    available in the /.extra/ directory in the initrd context.

  8. The Linux kernel from the .linux PE section is invoked with with
    a combined initrd that is composed from the blob from the .initrdPE section, the dynamically generated initrd containing the
    .pcrsig and .pcrpkey PE sections, and possibly some additional
    components like sysexts or syscfgs.

TPM PCR Assignments

In the construction above we take possession of two PCR registers
previously unused on generic Linux distributions:

  • TPM PCR 11 shall contain measurements of all components of the
    UKI (with exception of the .pcrsig PE section, see above). This
    PCR will also contain measurements of the boot phase once userspace
    takes over (see below).

  • TPM PCR 12 shall contain measurements of the used kernel command
    line. (Plus potentially other forms of
    parameterization/configuration passed into the UKI, not discussed in
    this document)

On top of that we intend to define two more PCR registers like this:

  • TPM PCR 15 shall contain measurements of the volume encryption
    key of the root file system of the OS.

  • [TPM PCR 13 shall contain measurements of additional extension
    images for the initrd, to enable a modularized initrd – not covered
    by this document]

(See the Linux TPM PCR
Registry

for an overview how these four PCRs fit into the list of Linux PCR
assignments.)

For all four PCRs the assumption is that they are zero before the UKI
initializes, and only the data that the UKI and the OS measure into
them is included. This makes pre-calculating them straightforward:
given a specific set of UKI components, it is immediately clear what
PCR values can be expected in PCR 11 once the UKI booted up. Given a
kernel command line (and other parameterization/configuration) it is
clear what PCR values are expected in PCR 12.

Note that these four PCRs are defined by the conceptual “owner” of the
resources measured into them. PCR 11 only contains resources the OS
vendor
controls. Thus it is straight-forward for the OS vendor to
pre-calculate and then cryptographically sign the expected values for
PCR 11. The PCR 11 values will be identical on all systems that run
the same version of the UKI. PCR 12 only contains resources the
administrator controls, thus the administrator can pre-calculate
PCR values, and they will be correct on all instances of the OS that
use the same parameters/configuration. PCR 15 only contains resources
inherently local to the local system, i.e. the cryptographic key
material that encrypts the root file system of the OS.

Separating out these three roles does not imply these actually need to
be separate when used. However the assumption is that in many popular
environments these three roles should be separate.

By separating out these PCRs by the owner’s role, it becomes
straightforward to remotely attest, individually, on the software that
runs on a node (PCR 11), the configuration it uses (PCR 12) or the
identity of the system (PCR 15). Moreover, it becomes straightforward
to robustly and securely encrypt data so that it can only be unlocked
on a specific set of systems that share the same OS, or the same
configuration, or have a specific identity – or a combination thereof.

Note that the mentioned PCRs are so far not typically used on generic
Linux-based operating systems, to our knowledge. Windows uses them,
but given that Windows and Linux should typically not be included in
the same boot process this should be unproblematic, as Windows’ use of
these PCRs should thus not conflict with ours.

To summarize:

PCR Purpose Owner Expected Value before UKI boot Pre-Calculable
11 Measurement of UKI components and boot phases OS Vendor Zero Yes
(at UKI build time)
12 Measurement of kernel command line, additional kernel runtime configuration such as systemd credentials, systemd syscfg images Administrator Zero Yes
(when system configuration is assembled)
13 System Extension Images of initrd
(and possibly more)
(Administrator) Zero Yes
15 Measurement of root file system volume key
(Possibly later more: measurement of root file system UUIDs and labels and of the machine ID /etc/machine-id)
Local System Zero Yes
(after first boot once ll such IDs are determined)

Signature Keys

In the model above in particular two sets of private/public key pairs
are relevant:

  • The SecureBoot key to sign the UKI PE executable with. This controls
    permissible choices of OS/kernel

  • The key to sign the expected PCR 11 values with. Signatures made
    with this key will end up in the .pcrsig PE section. The public
    key part will end up in the .pcrpkey PE section.

Typically the key pair for the PCR 11 signatures should be chosen with
a narrow focus, reused for exactly one specific OS (e.g. “Fedora
Desktop Edition”) and the series of UKIs that belong to it (all the
way through all the versions of the OS). The SecureBoot signature key
can be used with a broader focus, if desired. By keeping the PCR 11
signature key narrow in focus one can ensure that secrets bound to the
signature key can only be unlocked on the narrow set of UKIs desired.

TPM Policy Use

Depending on the intended access policy to a resource protected by the
TPM, one or more of the PCRs described above should be selected to
bind TPM policy to.

For example, the root file system encryption key should likely be
bound to TPM PCR 11, so that it can only be unlocked if a specific set
of UKIs is booted (it should then, once acquired, be measured into PCR
15, as discussed above, so that later TPM objects can be bound to it,
further down the chain). With the model described above this is
reasonably straight-forward to do:

  • When userspace wants to bind disk encryption to a specific series of
    UKIs (“enrollment”), it looks for the public key passed to the
    initrd in the /.extra/ directory (which as discussed above
    originates in the .pcrpkey PE section of the UKI). The relevant
    userspace component (e.g. systemd) is then responsible for
    generating a random key to be used as symmetric encryption key for
    the storage volume (let’s call it disk encryption key _here,
    DEK_). The TPM is then used to encrypt (“seal”) the DEK with its
    internal Storage Root Key (TPM SRK). A TPM2 policy is bound to the
    encrypted DEK. The policy enforces that the DEK may only be
    decrypted if a valid signature is provided that matches the state of
    PCR 11 and the public key provided in the /.extra/ directory of
    the initrd. The plaintext DEK key is passed to the kernel to
    implement disk encryption (e.g. LUKS/dm-crypt). (Alternatively,
    hardware disk encryption can be used too, i.e. Intel MKTME, AMD SME
    or even OPAL, all of which are outside of the scope of this
    document.) The TPM-encrypted version of the DEK which the TPM
    returned is written to the encrypted volume’s superblock.

  • When userspace wants to unlock disk encryption on a specific
    UKI, it looks for the signature data passed to the initrd in the
    /.extra/ directory (which as discussed above originates in the
    .pcrsig PE section of the UKI). It then reads the encrypted
    version of the DEK from the superblock of the encrypted volume. The
    signature and the encrypted DEK are then passed to the TPM. The TPM
    then checks if the current PCR 11 state matches the supplied
    signature from the.pcrsig section and the public key used during
    enrollment. If all checks out it decrypts (“unseals”) the DEK and
    passes it back to the OS, where it is then passed to the kernel
    which implements the symmetric part of disk encryption.

Note that in this scheme the encrypted volume’s DEK is not bound
to specific literal PCR hash values, but to a public key which is
expected to sign PCR hash values.

Also note that the state of PCR 11 only matters during unlocking. It
is not used or checked when enrolling.

In this scenario:

  • Input to the TPM part of the enrollment process are the TPM’s
    internal SRK, the plaintext DEK provided by the OS, and the public
    key later used for signing expected PCR values, also provided by the
    OS. – Output is the encrypted (“sealed”) DEK.

  • Input to the TPM part of the unlocking process are the TPM’s
    internal SRK, the current TPM PCR 11 values, the public key used
    during enrollment, a signature that matches both these PCR values
    and the public key, and the encrypted DEK. – Output is the plaintext
    (“unsealed”) DEK.

Note that sealing/unsealing is done entirely on the TPM chip, the host
OS just provides the inputs (well, only the inputs that the TPM chip
doesn’t know already on its own), and receives the outputs. With the
exception of the plaintext DEK, none of the inputs/outputs are
sensitive, and can safely be stored in the open. On the wire the
plaintext DEK is protected via TPM parameter encryption (not discussed
in detail here because though important not in scope for this
document).

TPM PCR 11 is the most important of the mentioned PCRs, and its use is
thus explained in detail here. The other mentioned PCRs can be used in
similar ways, but signatures/public keys must be provided via other
means.

This scheme builds on the functionality Linux’ LUKS2 functionality
provides, i.e. key management supporting multiple slots, and the
ability to embed arbitrary metadata in the encrypted volume’s
superblock. Note that this means the TPM2-based logic explained here
doesn’t have to be the only way to unlock an encrypted volume. For
example, in many setups it is wise to enroll both this TPM-based
mechanism and an additional “recovery key” (i.e. a high-entropy
computer generated passphrase the user can provide manually in case
they lose access to the TPM and need to access their data), of which
either can be used to unlock the volume.

Boot Phases

Secrets needed during boot-up (such as the root file system encryption
key) should typically not be accessible anymore afterwards, to protect
them from access if a system is attacked during runtime. To implement
this the scheme above is extended in one way: at certain milestones of
the boot process additional fixed “words” should be measured into PCR
11. These milestones are placed at conceptual security boundaries,
i.e. whenever code transitions from a higher privileged context to a
less privileged context.

Specifically:

  • When the initrd initializes (“initrd-enter”)

  • When the initrd transitions into the root file system (“initrd-leave”)

  • When the early boot phase of the OS on the root file system has
    completed, i.e. all storage and file systems have been set up and
    mounted, immediately before regular services are started
    (“sysinit”)

  • When the OS on the root file system completed the boot process far
    enough to allow unprivileged users to log in (“complete”)

  • When the OS begins shut down (“shutdown”)

  • When the service manager is mostly finished with shutting down and
    is about to pass control to the final phase of the shutdown logic
    (“final”)

By measuring these additional words into PCR 11 the distinct phases of
the boot process can be distinguished in a relatively straight-forward
fashion and the expected PCR values in each phase can be determined.

The phases are measured into PCR 11 (as opposed to some other PCR)
mostly because available PCRs are scarce, and the boot phases defined
are typically specific to a chosen OS, and hence fit well with the
other data measured into PCR 11: the UKI which is also specific to the
OS. The OS vendor generates both the UKI and defines the boot phases,
and thus can safely and reliably pre-calculate/sign the expected PCR
values for each phase of the boot.

Revocation/Rollback Protection

In order to secure secrets stored at rest, in particular in
environments where unattended decryption shall be possible, it is
essential that an attacker cannot use old, known-buggy – but properly
signed – versions of software to access them.

Specifically, if disk encryption is bound to an OS vendor (via UKIs
that include expected PCR values, signed by the vendor’s public key)
there must be a mechanism to lock out old versions of the OS or UKI
from accessing TPM based secrets once it is determined that the old
version is vulnerable.

To implement this we propose making use of one of the “counters” TPM
2.0 devices provide: integer registers that are persistent in the TPM
and can only be increased on request of the OS, but never be
decreased. When sealing resources to the TPM, a policy may be declared
to the TPM that restricts how the resources can later be unlocked:
here we use one that requires that along with the expected PCR values
(as discussed above) a counter integer range is provided to the TPM
chip, along with a suitable signature covering both, matching the
public key provided during sealing. The sealing/unsealing mechanism
described above is thus extended: the signature passed to the TPM
during unsealing now covers both the expected PCR values and the
expected counter range. To be able to use a signature associated with
an UKI provided by the vendor to unseal a resource, the counter thus
must be at least increased to the lower end of the range the signature
is for. By doing so the ability is lost to unseal the resource for
signatures associated with older versions of the UKI, because their
upper end of the range disables access once the counter has been
increased far enough. By carefully choosing the upper and lower end of
the counter range whenever the PCR values for an UKI shall be signed
it is thus possible to ensure that updates can invalidate prior
versions’ access to resources. By placing some space between the upper
and lower end of the range it is possible to allow a controlled level
of fallback UKI support, with clearly defined milestones where
fallback to older versions of an UKI is not permitted anymore.

Example: a hypothetical distribution FooOS releases a regular stream
of UKI kernels 5.1, 5.2, 5.3, … It signs the expected PCR values for
these kernels with a key pair it maintains in a HSM. When signing UKI
5.1 it includes information directed at the TPM in the signed data
declaring that the TPM counter must be above 100, and below 120, in
order for the signature to be used. Thus, when the UKI is booted up
and used for unlocking an encrypted volume the unlocking code must
first increase the counter to 100 if needed, as the TPM will otherwise
refuse unlocking the volume. The next release of the UKI, i.e. UKI 5.2
is a feature release, i.e. reverting back to the old kernel locally is
acceptable. It thus does not increase the lower bound, but it
increases the upper bound for the counter in the signature payload,
thus encoding a valid range 100…121 in the signed payload. Now a major
security vulnerability is discovered in UKI 5.1. A new UKI 5.3 is
prepared that fixes this issue. It is now essential that UKI 5.1 can
no longer be used to unlock the TPM secrets. Thus UKI 5.3 will bump
the lower bound to 121, and increase the upper bound by one, thus
allowing a range 121…122. Or in other words: for each new UKI release
the signed data shall include a counter range declaration where the
upper bound is increased by one. The lower range is left as-is between
releases, except when an old version shall be cut off, in which case
it is bumped to one above the upper bound used in that release.

UKI Generation

As mentioned earlier, UKIs are the combination of various resources
into one PE file. For most of these individual components there are
pre-existing tools to generate the components. For example the
included kernel image can be generated with the usual Linux kernel
build system. The initrd included in the UKI can be generated with
existing tools such as dracut and similar. Once the basic components
(.linux, .initrd, .cmdline, .splash, .dtb, .osrel,
.uname) have been acquired the combination process works roughly
like this:

  1. The expected PCR 11 hashes (and signatures for them) for the UKI
    are calculated. The tool for that takes all basic UKI components
    and a signing key as input, and generates a JSON object as output
    that includes both the literal expected PCR hash values and a
    signature for them. (For all selected TPM2 banks)

  2. The EFI stub binary is now combined with the basic components, the
    generated JSON PCR signature object from the first step (in the
    .pcrsig section) and the public key for it (in the .pcrpkey
    section). This is done via a simple “objcopy” invocation
    resulting in a single UKI PE binary.

  3. The resulting EFI PE binary is then signed for SecureBoot (via a
    tool such as
    sbsign
    or similar).

Note that the UKI model implies pre-built initrds. How to generate
these (and securely extend and parameterize them) is outside of the
scope of this document, but a related document will be provided
highlighting these concepts.

Protection Coverage of SecureBoot Signing and PCRs

The scheme discussed here touches both SecureBoot code signing and TPM
PCR measurements. These two distinct mechanisms cover separate parts
of the boot process.

Specifically:

  • Firmware/Shim SecureBoot signing covers bootloader and UKI

  • TPM PCR 11 covers the UKI components and boot phase

  • TPM PCR 12 covers admin configuration

  • TPM PCR 15 covers the local identity of the host

Note that this means SecureBoot coverage ends once the system
transitions from the initrd into the root file system. It is assumed
that trust and integrity have been established before this transition
by some means, for example LUKS/dm-crypt/dm-integrity, ideally bound
to PCR 11 (i.e. UKI and boot phase).

A robust and secure update scheme for PCR 11 (i.e. UKI) has been
described above, which allows binding TPM-locked resources to a
UKI. For PCR 12 no such scheme is currently designed, but might be
added later (use case: permit access to certain secrets only if the
system runs with configuration signed by a specific set of
keys). Given that resources measured into PCR 15 typically aren’t
updated (or if they are updated loss of access to other resources
linked to them is desired) no update scheme should be necessary for
it.

This document focuses on the three PCRs discussed above. Disk
encryption and other userspace may choose to also bind to other
PCRs. However, doing so means the PCR brittleness issue returns that
this design is supposed to remove. PCRs defined by the various
firmware UEFI/TPM specifications generally do not know any concept for
signatures of expected PCR values.

It is known that the industry-adopted SecureBoot signing keys are too
broad to act as more than a denylist for known bad code. It is thus
probably a good idea to enroll vendor SecureBoot keys wherever
possible (e.g. in environments where the hardware is very well known,
and VM environments), to raise the bar on preparing rogue UKI-like PE
binaries that will result in PCR values that match expectations but
actually contain bad code. Discussion about that is however outside of
the scope of this document.

Whole OS embedded in the UKI

The above is written under the assumption that the UKI embeds an
initrd whose job it is to set up the root file system: find it,
validate it, cryptographically unlock it and similar. Once the root
file system is found, the system transitions into it.

While this is the traditional design and likely what most systems will
use, it is also possible to embed a regular root file system into the
UKI and avoid any transition to an on-disk root file system. In this
mode the whole OS would be encapsulated in the UKI, and
signed/measured as one. In such a scenario the whole of the OS must be
loaded into RAM and remain there, which typically restricts the
general usability of such an approach. However, for specific purposes
this might be the design of choice, for example to implement
self-sufficient recovery or provisioning systems.

Proposed Implementations & Current Status

The toolset for most of the above is already implemented in systemd and related projects in one way or another. Specifically:

  1. The
    systemd-stub
    (or short: sd-stub) component implements the discussed UEFI stub
    program

  2. The
    systemd-measure
    tool can be used to pre-calculate expected PCR 11 values given the
    UKI components and can sign the result, as discussed in the UKI
    Image Generation section above.

  3. The
    systemd-cryptenroll
    and
    systemd-cryptsetup
    tools can be used to bind a LUKS2 encrypted file system volume to a
    TPM and PCR 11 public key/signatures, according to the scheme
    described above. (The two components also implement a “recovery
    key
    ” concept, as discussed above)

  4. The
    systemd-pcrphase
    component measures specific words into PCR 11 at the discussed
    phases of the boot process.

  5. The
    systemd-creds
    tool may be used to encrypt/decrypt data objects called
    “credentials” that can be passed into services and booted systems,
    and are automatically decrypted (if needed) immediately before
    service invocation. Encryption is typically bound to the local TPM,
    to ensure the data cannot be recovered elsewhere.

Note that
systemd-stub
(i.e. the UEFI code glued into the UKI) is distinct from
systemd-boot
(i.e. the UEFI boot loader than can manage multiple UKIs and other
boot menu items and implements automatic fallback, an interactive menu
and a programmatic interface for the OS among other things). One can
be used without the other – both sd-stub without sd-boot and vice
versa – though they integrate nicely if used in combination.

Note that the mechanisms described are relatively generic, and can be
implemented and be consumed in other software too, systemd should be
considered a reference implementation, though one that found
comprehensive adoption across Linux distributions.

Some concepts discussed above are currently not
implemented. Specifically:

  1. The rollback protection logic is currently not implemented.

  2. The mentioned measurement of the root file system volume key to PCR
    15 is implemented, but not merged into the systemd main branch yet.

Glossary

TPM

Trusted Platform Module; a security chip found in many modern
systems, both physical systems and increasingly also in virtualized
environments. Traditionally a discrete chip on the mainboard but today
often implemented in firmware, and lately directly in the CPU SoC.

PCR

Platform Configuration Register; a set of registers on a TPM that
are initialized to zero at boot. The firmware and OS can “extend
these registers with hashes of data used during the boot process and
afterwards. “Extension” means the supplied data is first
cryptographically hashed. The resulting hash value is then combined
with the previous value of the PCR and the combination hashed
again. The result will become the new value of the PCR. By doing this
iteratively for all parts of the boot process (always with the data
that will be used next during the boot process) a concept of
Measured Boot” can be implemented: as long as every element in the
boot chain measures (i.e. extends into the PCR) the next part of the
boot like this, the resulting PCR values will prove cryptographically
that only a certain set of boot components can have been used to boot
up. A standards compliant TPM usually has 24 PCRs, but more than half
of those are already assigned specific meanings by the firmware. Some
of the others may be used by the OS, of which we use four in the
concepts discussed in this document.

Measurement

The act of “extending” a PCR with some data object.

SRK

Storage Root Key; a special cryptographic key generated by a TPM
that never leaves the TPM, and can be used to encrypt/decrypt data
passed to the TPM.

UKI

Unified Kernel Image; the concept this document is about. A
combination of kernel, initrd and other resources. See above.

SecureBoot

A mechanism where every software component involved in the boot
process is cryptographically signed and checked against a set of
public keys stored in the mainboard hardware, implemented in firmware,
before it is used.

Measured Boot

A boot process where each component measures (i.e., hashes and extends
into a TPM PCR, see above) the next component it will pass control to
before doing so. This serves two purposes: it can be used to bind
security policy for encrypted secrets to the resulting PCR values (or
signatures thereof, see above), and it can be used to reason about
used software after the fact, for example for the purpose of remote
attestation.

initrd

Short for “initial RAM disk”, which – strictly speaking – is a
misnomer today, because no RAM disk is anymore involved, but a tmpfs
file system instance. Also known as “initramfs”, which is also
misleading, given the file system is not ramfs anymore, but tmpfs
(both of which are in-memory file systems on Linux, with different
semantics). The initrd is passed to the Linux kernel and is
basically a file system tree in cpio archive. The kernel unpacks the
image into a tmpfs (i.e., into an in-memory file system), and then
executes a binary from it. It thus contains the binaries for the first
userspace code the kernel invokes. Typically, the initrd’s job is to
find the actual root file system, unlock it (if encrypted), and
transition into it.

UEFI

Short for “Unified Extensible Firmware Interface”, it is a widely
adopted standard for PC firmware, with native support for SecureBoot
and Measured Boot.

EFI

More or less synonymous to UEFI, IRL.

Shim

A boot component originating in the Linux world, which in a way
extends the public key database SecureBoot maintains (which is under
control from Microsoft) with a second layer (which is under control of
the Linux distributions and of the owner of the physical device).

PE

Portable Executable; a file format for executable binaries,
originally from the Windows world, but also used by UEFI firmware. PE
files may contain code and data, categorized in labeled “sections”

ESP

EFI System Partition; a special partition on a storage
medium that the firmware is able to look for UEFI PE binaries
in to execute at boot.

HSM

Hardware Security Module; a piece of hardware that can generate and
store secret cryptographic keys, and execute operations with them,
without the keys leaving the hardware (though this is
configurable). TPMs can act as HSMs.

DEK

Disk Encryption Key; an asymmetric cryptographic key used for
unlocking disk encryption, i.e. passed to LUKS/dm-crypt for activating
an encrypted storage volume.

LUKS2

Linux Unified Key Setup Version 2; a specification for a superblock
for encrypted volumes widely used on Linux. LUKS2 is the default
on-disk format for the cryptsetup suite of tools. It provides
flexible key management with multiple independent key slots and allows
embedding arbitrary metadata in a JSON format in the superblock.

Thanks

I’d like to thank Alain Gefflaut, Anna Trikalinou, Christian Brauner,
Daan de Meyer, Luca Boccassi, Zbigniew Jędrzejewski-Szmek for
reviewing this text.

Fitting Everything Together

Post Syndicated from original https://0pointer.net/blog/fitting-everything-together.html

TLDR: Hermetic /usr/ is awesome; let’s popularize image-based OSes
with modernized security properties built around immutability,
SecureBoot, TPM2, adaptability, auto-updating, factory reset,
uniformity – built from traditional distribution packages, but
deployed via images.

Over the past years, systemd gained a number of components for
building Linux-based operating systems. While these components
individually have been adopted by many distributions and products for
specific purposes, we did not publicly communicate a broader vision
of how they should all fit together in the long run. In this blog story I
hope to provide that from my personal perspective, i.e. explain how I
personally would build an OS and where I personally think OS
development with Linux should go.

I figure this is going to be a longer blog story, but I hope it
will be equally enlightening. Please understand though that everything
I write about OS design here is my personal opinion, and not one of my
employer.

For the last 12 years or so I have been working on Linux OS
development, mostly around systemd. In all those years I had a lot
of time thinking about the Linux platform, and specifically
traditional Linux distributions and their strengths and weaknesses. I
have seen many attempts to reinvent Linux distributions in one way or
another, to varying success. After all this most would probably
agree that the traditional RPM or dpkg/apt-based distributions still
define the Linux platform more than others (for 25+ years now), even
though some Linux-based OSes (Android, ChromeOS) probably outnumber
the installations overall.

And over all those 12 years I kept wondering, how would I actually
build an OS for a system or for an appliance, and what are the
components necessary to achieve that. And most importantly, how can we
make these components generic enough so that they are useful in
generic/traditional distributions too, and in other use cases than my
own.

The Project

Before figuring out how I would build an OS it’s probably good to
figure out what type of OS I actually want to build, what purpose I
intend to cover. I think a desktop OS is probably the most
interesting. Why is that? Well, first of all, I use one of these for my
job every single day, so I care immediately, it’s my primary tool of
work. But more importantly: I think building a desktop OS is one of
the most complex overall OS projects you can work on, simply because
desktops are so much more versatile and variable than servers or
embedded devices. If one figures out the desktop case, I think there’s
a lot more to learn from, and reuse in the server or embedded case,
then going the other way. After all, there’s a reason why so much of the
widely accepted Linux userspace stack comes from people with a desktop
background (including systemd, BTW).

So, let’s see how I would build a desktop OS. If you press me hard,
and ask me why I would do that given that ChromeOS already exists and
more or less is a Linux desktop OS: there’s plenty I am missing in
ChromeOS, but most importantly, I am lot more interested in building
something people can easily and naturally rebuild and hack on,
i.e. Google-style over-the-wall open source with its skewed power
dynamic is not particularly attractive to me. I much prefer building
this within the framework of a proper open source community, out in
the open, and basing all this strongly on the status quo ante,
i.e. the existing distributions. I think it is crucial to provide a
clear avenue to build a modern OS based on the existing distribution
model, if there shall ever be a chance to make this interesting for a
larger audience.

(Let me underline though: even though I am going to focus on a desktop
here, most of this is directly relevant for servers as well, in
particular container host OSes and suchlike, or embedded devices,
e.g. car IVI systems and so on.)

Design Goals

  1. First and foremost, I think the focus must be on an image-based
    design rather than a package-based one. For robustness and security
    it is essential to operate with reproducible, immutable images that
    describe the OS or large parts of it in full, rather than operating
    always with fine-grained RPM/dpkg style packages. That’s not to say
    that packages are not relevant (I actually think they matter a
    lot!), but I think they should be less of a tool for deploying code
    but more one of building the objects to deploy. A different way to
    see this: any OS built like this must be easy to replicate in a
    large number of instances, with minimal variability. Regardless if
    we talk about desktops, servers or embedded devices: focus for my
    OS should be on “cattle”, not “pets”, i.e that from the start it’s
    trivial to reuse the well-tested, cryptographically signed
    combination of software over a large set of devices the same way,
    with a maximum of bit-exact reuse and a minimum of local variances.

  2. The trust chain matters, from the boot loader all the way to the
    apps. This means all code that is run must be cryptographically
    validated before it is run. All storage must be cryptographically
    protected: public data must be integrity checked; private data must
    remain confidential.

    This is in fact where big distributions currently fail pretty
    badly. I would go as far as saying that SecureBoot on Linux
    distributions is mostly security theater at this point, if you so
    will. That’s because the initrd that unlocks your FDE (i.e. the
    cryptographic concept that protects the rest of your system) is not
    signed or protected in any way. It’s trivial to modify for an
    attacker with access to your hard disk in an undetectable way, and
    collect your FDE passphrase. The involved bureaucracy around the
    implementation of UEFI SecureBoot of the big distributions is to a
    large degree pointless if you ask me, given that once the kernel is
    assumed to be in a good state, as the next step the system invokes
    completely unsafe code with full privileges.

    This is a fault of current Linux distributions though, not of
    SecureBoot in general. Other OSes use this functionality in more
    useful ways, and we should correct that too.

  3. Pretty much the same thing: offline security matters. I want
    my data to be reasonably safe at rest, i.e. cryptographically
    inaccessible even when I leave my laptop in my hotel room,
    suspended.

  4. Everything should be cryptographically measured, so that remote
    attestation is supported for as much software shipped on the OS as
    possible.

  5. Everything should be self descriptive, have single sources of truths
    that are closely attached to the object itself, instead of stored
    externally.

  6. Everything should be self-updating. Today we know that software is
    never bug-free, and thus requires a continuous update cycle. Not
    only the OS itself, but also any extensions, services and apps
    running on it.

  7. Everything should be robust in respect to aborted OS operations,
    power loss and so on. It should be robust towards hosed OS updates
    (regardless if the download process failed, or the image was
    buggy), and not require user interaction to recover from them.

  8. There must always be a way to put the system back into a
    well-defined, guaranteed safe state (“factory reset”). This
    includes that all sensitive data from earlier uses becomes
    cryptographically inaccessible.

  9. The OS should enforce clear separation between vendor resources,
    system resources and user resources: conceptually and when it comes
    to cryptographical protection.

  10. Things should be adaptive: the system should come up and make the
    best of the system it runs on, adapt to the storage and
    hardware. Moreover, the system should support execution on bare
    metal equally well as execution in a VM environment and in a
    container environment (i.e. systemd-nspawn).

  11. Things should not require explicit installation. i.e. every image
    should be a live image. For installation it should be sufficient to
    dd an OS image onto disk. Thus, strong focus on “instantiate on
    first boot”, rather than “instantiate before first boot”.

  12. Things should be reasonably minimal. The image the system starts
    its life with should be quick to download, and not include
    resources that can as well be created locally later.

  13. System identity, local cryptographic keys and so on should be
    generated locally, not be pre-provisioned, so that there’s no leak
    of sensitive data during the transport onto the system possible.

  14. Things should be reasonably democratic and hackable. It should be
    easy to fork an OS, to modify an OS and still get reasonable
    cryptographic protection. Modifying your OS should not necessarily
    imply that your “warranty is voided” and you lose all good
    properties of the OS, if you so will.

  15. Things should be reasonably modular. The privileged part of the
    core OS must be extensible, including on the individual system.
    It’s not sufficient to support extensibility just through
    high-level UI applications.

  16. Things should be reasonably uniform, i.e. ideally the same formats
    and cryptographic properties are used for all components of the
    system, regardless if for the host OS itself or the payloads it
    receives and runs.

  17. Even taking all these goals into consideration, it should still be
    close to traditional Linux distributions, and take advantage of what
    they are really good at: integration and security update cycles.

Now that we know our goals and requirements, let’s start designing the
OS along these lines.

Hermetic /usr/

First of all the OS resources (code, data files, …) should be
hermetic in an immutable /usr/. This means that a /usr/ tree
should carry everything needed to set up the minimal set of
directories and files outside of /usr/ to make the system work. This
/usr/ tree can then be mounted read-only into the writable root file
system that then will eventually carry the local configuration, state
and user data in /etc/, /var/ and /home/ as usual.

Thankfully, modern distributions are surprisingly close to working
without issues in such a hermetic context. Specifically, Fedora works
mostly just fine: it has adopted the /usr/ merge and the declarative
systemd-sysusers
and
systemd-tmpfiles
components quite comprehensively, which means the directory trees
outside of /usr/ are automatically generated as needed if missing.
In particular /etc/passwd and /etc/group (and related files) are
appropriately populated, should they be missing entries.

In my model a hermetic OS is hence comprehensively defined within
/usr/: combine the /usr/ tree with an empty, otherwise unpopulated
root file system, and it will boot up successfully, automatically
adding the strictly necessary files, and resources that are necessary
to boot up.

Monopolizing vendor OS resources and definitions in an immutable
/usr/ opens multiple doors to us:

  • We can apply dm-verity to the whole /usr/ tree, i.e. guarantee
    structural, cryptographic integrity on the whole vendor OS resources
    at once, with full file system metadata.

  • We can implement updates to the OS easily: by implementing an A/B
    update scheme on the /usr/ tree we can update the OS resources
    atomically and robustly, while leaving the rest of the OS environment
    untouched.

  • We can implement factory reset easily: erase the root file system
    and reboot. The hermetic OS in /usr/ has all the information it
    needs to set up the root file system afresh — exactly like in a new
    installation.

Initial Look at the Partition Table

So let’s have a look at a suitable partition table, taking a hermetic
/usr/ into account. Let’s conceptually start with a table of four
entries:

  1. An UEFI System Partition (required by firmware to boot)

  2. Immutable, Verity-protected, signed file system with the /usr/ tree in version A

  3. Immutable, Verity-protected, signed file system with the /usr/ tree in version B

  4. A writable, encrypted root file system

(This is just for initial illustration here, as we’ll see later it’s
going to be a bit more complex in the end.)

The Discoverable Partitions
Specification
provides
suitable partition types UUIDs for all of the above partitions. Which
is great, because it makes the image self-descriptive: simply by
looking at the image’s GPT table we know what to mount where. This
means we do not need a manual /etc/fstab, and a multitude of tools
such as systemd-nspawn and similar can operate directly on the disk
image and boot it up.

Booting

Now that we have a rough idea how to organize the partition table,
let’s look a bit at how to boot into that. Specifically, in my model
“unified kernels” are the way to go, specifically those implementing
Boot Loader Specification Type #2. These are basically
kernel images that have an initial RAM disk attached to them, as well as
a kernel command line, a boot splash image and possibly more, all
wrapped into a single UEFI PE binary. By combining these into one we
achieve two goals: they become extremely easy to update (i.e. drop in
one file, and you update kernel+initrd) and more importantly, you can
sign them as one for the purpose of UEFI SecureBoot.

In my model, each version of such a kernel would be associated with
exactly one version of the /usr/ tree: both are always updated at
the same time. An update then becomes relatively simple: drop in one
new /usr/ file system plus one kernel, and the update is complete.

The boot loader used for all this would be
systemd-boot,
of course. It’s a very simple loader, and implements the
aforementioned boot loader specification. This means it requires no
explicit configuration or anything: it’s entirely sufficient to drop
in one such unified kernel file, and it will be picked up, and be made
a candidate to boot into.

You might wonder how to configure the root file system to boot from
with such a unified kernel that contains the kernel command line and
is signed as a whole and thus immutable. The idea here is to use the
usrhash= kernel command line option implemented by
systemd-veritysetup-generator
and
systemd-fstab-generator. It
does two things: it will search and set up a dm-verity volume for
the /usr/ file system, and then mount it. It takes the root hash
value of the dm-verity Merkle tree as the parameter. This hash is
then also used to find the /usr/ partition in the GPT partition
table, under the assumption that the partition UUIDs are derived from
it, as per the suggestions in the discoverable partitions
specification (see above).

systemd-boot (if not told otherwise) will do a version sort of the
kernel image files it finds, and then automatically boot the newest
one. Picking a specific kernel to boot will also fixate which version
of the /usr/ tree to boot into, because — as mentioned — the Verity
root hash of it is built into the kernel command line the unified
kernel image contains.

In my model I’d place the kernels directly into the UEFI System
Partition (ESP), in order to simplify things. (systemd-boot also
supports reading them from a separate boot partition, but let’s not
complicate things needlessly, at least for now.)

So, with all this, we now already have a boot chain that goes
something like this: once the boot loader is run, it will pick the
newest kernel, which includes the initial RAM disk and a secure
reference to the /usr/ file system to use. This is already
great. But a /usr/ alone won’t make us happy, we also need a root
file system. In my model, that file system would be writable, and the
/etc/ and /var/ hierarchies would be located directly on it. Since
these trees potentially contain secrets (SSH keys, …) the root file
system needs to be encrypted. We’ll use LUKS2 for this, of course. In
my model, I’d bind this to the TPM2 chip (for compatibility with
systems lacking one, we can find a suitable fallback, which then
provides weaker guarantees, see below). A TPM2 is a security chip
available in most modern PCs. Among other things it contains a
persistent secret key that can be used to encrypt data, in a way that
only if you possess access to it and can prove you are using validated
software you can decrypt it again. The cryptographic measuring I
mentioned earlier is what allows this to work. But … let’s not get
lost too much in the details of TPM2 devices, that’d be material for a
novel, and this blog story is going to be way too long already.

What does using a TPM2 bound key for unlocking the root file system
get us? We can encrypt the root file system with it, and you can only
read or make changes to the root file system if you also possess the
TPM2 chip and run our validated version of the OS. This protects us
against an evil maid scenario to some level: an attacker cannot
just copy the hard disk of your laptop while you leave it in your
hotel room, because unless the attacker also steals the TPM2 device it
cannot be decrypted. The attacker can also not just modify the root
file system, because such changes would be detected on next boot
because they aren’t done with the right cryptographic key.

So, now we have a system that already can boot up somewhat completely,
and run userspace services. All code that is run is verified in some
way: the /usr/ file system is Verity protected, and the root hash of
it is included in the kernel that is signed via UEFI SecureBoot. And
the root file system is locked to the TPM2 where the secret key is
only accessible if our signed OS + /usr/ tree is used.

(One brief intermission here: so far all the components I am
referencing here exist already, and have been shipped in systemd and
other projects already, including the TPM2 based disk
encryption. There’s one thing missing here however at the moment that
still needs to be developed (happy to take PRs!): right now TPM2 based
LUKS2 unlocking is bound to PCR hash values. This is hard to work with
when implementing updates — what we’d need instead is unlocking by
signatures of PCR hashes. TPM2 supports this, but we don’t support it
yet in our systemd-cryptsetup + systemd-cryptenroll stack.)

One of the goals mentioned above is that cryptographic key material
should always be generated locally on first boot, rather than
pre-provisioned. This of course has implications for the encryption
key of the root file system: if we want to boot into this system we
need the root file system to exist, and thus a key already generated
that it is encrypted with. But where precisely would we generate it if
we have no installer which could generate while installing (as it is
done in traditional Linux distribution installers). My proposed
solution here is to use
systemd-repart,
which is a declarative, purely additive repartitioner. It can run from
the initrd to create and format partitions on boot, before
transitioning into the root file system. It can also format the
partitions it creates and encrypt them, automatically enrolling an
TPM2-bound key.

So, let’s revisit the partition table we mentioned earlier. Here’s
what in my model we’d actually ship in the initial image:

  1. An UEFI System Partition (ESP)

  2. An immutable, Verity-protected, signed file system with the /usr/ tree in version A

And that’s already it. No root file system, no B /usr/ partition,
nothing else. Only two partitions are shipped: the ESP with the
systemd-boot loader and one unified kernel image, and the A version
of the /usr/ partition. Then, on first boot systemd-repart will
notice that the root file system doesn’t exist yet, and will create
it, encrypt it, and format it, and enroll the key into the TPM2. It
will also create the second /usr/ partition (B) that we’ll need for
later A/B updates (which will be created empty for now, until the
first update operation actually takes place, see below). Once done the
initrd will combine the fresh root file system with the shipped
/usr/ tree, and transition into it. Because the OS is hermetic in
/usr/ and contains all the systemd-tmpfiles and systemd-sysuser
information it can then set up the root file system properly and
create any directories and symlinks (and maybe a few files) necessary
to operate.

Besides the fact that the root file system’s encryption keys are
generated on the system we boot from and never leave it, it is also
pretty nice that the root file system will be sized dynamically,
taking into account the physical size of the backing storage. This is
perfect, because on first boot the image will automatically adapt to what
it has been dd‘ed onto.

Factory Reset

This is a good point to talk about the factory reset logic, i.e. the
mechanism to place the system back into a known good state. This is
important for two reasons: in our laptop use case, once you want to
pass the laptop to someone else, you want to ensure your data is fully
and comprehensively erased. Moreover, if you have reason to believe
your device was hacked you want to revert the device to a known good
state, i.e. ensure that exploits cannot persist. systemd-repart
already has a mechanism for it. In the declarations of the partitions
the system should have, entries may be marked to be candidates for
erasing on factory reset. The actual factory reset is then requested
by one of two means: by specifying a specific kernel command line
option (which is not too interesting here, given we lock that down via
UEFI SecureBoot; but then again, one could also add a second kernel to
the ESP that is identical to the first, with only different that it
lists this command line option: thus when the user selects this entry
it will initiate a factory reset) — and via an EFI variable that can
be set and is honoured on the immediately following boot. So here’s
how a factory reset would then go down: once the factory reset is
requested it’s enough to reboot. On the subsequent boot
systemd-repart runs from the initrd, where it will honour the
request and erase the partitions marked for erasing. Once that is
complete the system is back in the state we shipped the system in:
only the ESP and the /usr/ file system will exist, but the root file
system is gone. And from here we can continue as on the original first
boot: create a new root file system (and any other partitions), and
encrypt/set it up afresh.

So now we have a nice setup, where everything is either signed or
encrypted securely. The system can adapt to the system it is booted on
automatically on first boot, and can easily be brought back into a
well defined state identical to the way it was shipped in.

Modularity

But of course, such a monolithic, immutable system is only useful for
very specific purposes. If /usr/ can’t be written to, – at least in
the traditional sense – one cannot just go and install a new software
package that one needs. So here two goals are superficially
conflicting: on one hand one wants modularity, i.e. the ability to
add components to the system, and on the other immutability, i.e. that
precisely this is prohibited.

So let’s see what I propose as a middle ground in my model. First,
what’s the precise use case for such modularity? I see a couple of
different ones:

  1. For some cases it is necessary to extend the system itself at the
    lowest level, so that the components added in extend (or maybe even
    replace) the resources shipped in the base OS image, so that they live
    in the same namespace, and are subject to the same security
    restrictions and privileges. Exposure to the details of the base OS
    and its interface for this kind of modularity is at the maximum.

    Example: a module that adds a debugger or tracing tools into the
    system. Or maybe an optional hardware driver module.

  2. In other cases, more isolation is preferable: instead of extending
    the system resources directly, additional services shall be added
    in that bring their own files, can live in their own namespace
    (but with “windows” into the host namespaces), however still are
    system components, and provide services to other programs, whether
    local or remote. Exposure to the details of the base OS for this
    kind of modularity is restricted: it mostly focuses on the
    ability to consume and provide IPC APIs from/to the
    system. Components of this type can still be highly privileged, but
    the level of integration is substantially smaller than for the type
    explained above.

    Example: a module that adds a specific VPN connection service to
    the OS.

  3. Finally, there’s the actual payload of the OS. This stuff is
    relatively isolated from the OS and definitely from each other. It
    mostly consumes OS APIs, and generally doesn’t provide OS
    APIs. This kind of stuff runs with minimal privileges, and in its
    own namespace of concepts.

    Example: a desktop app, for reading your emails.

Of course, the lines between these three types of modules are blurry,
but I think distinguishing them does make sense, as I think different
mechanisms are appropriate for each. So here’s what I’d propose in my
model to use for this.

  1. For the system extension case I think the
    systemd-sysext
    images are appropriate. This tool operates on
    system extension images that are very similar to the host’s disk
    image: they also contain a /usr/ partition, protected by
    Verity. However, they just include additions to the host image:
    binaries that extend the host. When such a system extension image
    is activated, it is merged via an immutable overlayfs mount into
    the host’s /usr/ tree. Thus any file shipped in such a system
    extension will suddenly appear as if it was part of the host OS
    itself. For optional components that should be considered part of
    the OS more or less this is a very simple and powerful way to
    combine an immutable OS with an immutable extension. Note that most
    likely extensions for an OS matching this tool should be built at
    the same time within the same update cycle scheme as the host OS
    itself. After all, the files included in the extensions will have
    dependencies on files in the system OS image, and care must be
    taken that these dependencies remain in order.

  2. For adding in additional somewhat isolated system services in my
    model, Portable Services
    are the proposed tool of choice. Portable services are in most ways
    just like regular system services; they could be included in the
    system OS image or an extension image. However, portable services
    use
    RootImage=
    to run off separate disk images, thus within their own
    namespace. Images set up this way have various ways to integrate
    into the host OS, as they are in most ways regular system services,
    which just happen to bring their own directory tree. Also, unlike
    regular system services, for them sandboxing is opt-out rather than
    opt-in. In my model, here too the disk images are Verity protected
    and thus immutable. Just like the host OS they are GPT disk images
    that come with a /usr/ partition and Verity data, along with
    signing.

  3. Finally, the actual payload of the OS, i.e. the apps. To be useful
    in real life here it is important to hook into existing ecosystems,
    so that a large set of apps are available. Given that on Linux
    flatpak (or on servers OCI containers) are the established format
    that pretty much won they are probably the way to go. That said, I
    think both of these mechanisms have relatively weak properties, in
    particular when it comes to security, since
    immutability/measurements and similar are not provided. This means,
    unlike for system extensions and portable services a complete trust
    chain with attestation and per-app cryptographically protected data
    is much harder to implement sanely.

What I’d like to underline here is that the main system OS image, as
well as the system extension images and the portable service images
are put together the same way: they are GPT disk images, with one
immutable file system and associated Verity data. The latter two
should also contain a PKCS#7 signature for the top-level Verity
hash. This uniformity has many benefits: you can use the same tools to
build and process these images, but most importantly: by using a
single way to validate them throughout the stack (i.e. Verity, in the
latter cases with PKCS#7 signatures), validation and measurement is
straightforward. In fact it’s so obvious that we don’t even have to
implement it in systemd: the kernel has direct support for this Verity
signature checking natively already (IMA).

So, by composing a system at runtime from a host image, extension
images and portable service images we have a nicely modular system
where every single component is cryptographically validated on every
single IO operation, and every component is measured, in its entire
combination, directly in the kernel’s IMA subsystem.

(Of course, once you add the desktop apps or OCI containers on top,
then these properties are lost further down the chain. But well, a lot
is already won, if you can close the chain that far down.)

Note that system extensions are not designed to replicate the fine
grained packaging logic of RPM/dpkg. Of course, systemd-sysext is a
generic tool, so you can use it for whatever you want, but there’s a
reason it does not bring support for a dependency language: the goal
here is not to replicate traditional Linux packaging (we have that
already, in RPM/dpkg, and I think they are actually OK for what they
do) but to provide delivery of larger, coarser sets of functionality,
in lockstep with the underlying OS’ life-cycle and in particular with
no interdependencies, except on the underlying OS.

Also note that depending on the use case it might make sense to also
use system extensions to modularize the initrd step. This is
probably less relevant for a desktop OS, but for server systems it
might make sense to package up support for specific complex storage in
a systemd-sysext system extension, which can be applied to the
initrd that is built into the unified kernel. (In fact, we have been
working on implementing signed yet modular initrd support to general
purpose Fedora this way.)

Note that portable services are composable from system extension too,
by the way. This makes them even more useful, as you can share a
common runtime between multiple portable service, or even use the host
image as common runtime for portable services. In this model a common
runtime image is shared between one or more system extensions, and
composed at runtime via an overlayfs instance.

More Modularity: Secondary OS Installs

Having an immutable, cryptographically locked down host OS is great I
think, and if we have some moderate modularity on top, that’s also
great. But oftentimes it’s useful to be able to depart/compromise for
some specific use cases from that, i.e. provide a bridge for example to
allow workloads designed around RPM/dpkg package management to coexist
reasonably nicely with such an immutable host.

For this purpose in my model I’d propose using systemd-nspawn
containers. The containers are focused on OS containerization,
i.e. they allow you to run a full OS with init system and everything
as payload (unlike for example Docker containers which focus on a
single service, and where running a full OS in it is a mess).

Running systemd-nspawn containers for such secondary OS installs has
various nice properties. One of course is that systemd-nspawn
supports the same level of cryptographic image validation that we rely
on for the host itself. Thus, to some level the whole OS trust chain
is reasonably recursive if desired: the firmware validates the OS, and the OS can
validate a secondary OS installed within it. In fact, we can run our
trusted OS recursively on itself and get similar security guarantees!
Besides these security aspects, systemd-nspawn also has really nice
properties when it comes to integration with the host. For example the
--bind-user= permits binding a host user record and their directory
into a container as a simple one step operation. This makes it
extremely easy to have a single user and $HOME but share it
concurrently with the host and a zoo of secondary OSes in
systemd-nspawn containers, which each could run different
distributions even.

Developer Mode

Superficially, an OS with an immutable /usr/ appears much less
hackable than an OS where everything is writable. Moreover, an OS
where everything must be signed and cryptographically validated makes
it hard to insert your own code, given you are unlikely to possess
access to the signing keys.

To address this issue other systems have supported a “developer” mode:
when entered the security guarantees are disabled, and the system can
be freely modified, without cryptographic validation. While that’s a
great concept to have I doubt it’s what most developers really want:
the cryptographic properties of the OS are great after all, it sucks
having to give them up once developer mode is activated.

In my model I’d thus propose two different approaches to this
problem. First of all, I think there’s value in allowing users to
additively extend/override the OS via local developer system
extensions
. With
this scheme the underlying cryptographic validation would remain in
tact, but — if this form of development mode is explicitly enabled –
the developer could add in more resources from local storage, that are
not tied to the OS builder’s chain of trust, but a local one
(i.e. simply backed by encrypted storage of some form).

The second approach is to make it easy to extend (or in fact replace)
the set of trusted validation keys, with local ones that are under the
control of the user, in order to make it easy to operate with kernel,
OS, extension, portable service or container images signed by the
local developer without involvement of the OS builder. This is
relatively easy to do for components down the trust chain, i.e. the
elements further up the chain should optionally allow additional
certificates to allow validation with.

(Note that systemd currently has no explicit support for a
“developer” mode like this. I think we should add that sooner or later
however.)

Democratizing Code Signing

Closely related to the question of developer mode is the question of
code signing. If you ask me, the status quo of UEFI SecureBoot code
signing in the major Linux distributions is pretty sad. The work to
get stuff signed is massive, but in effect it delivers very little in
return: because initrds are entirely unprotected, and reside on
partitions lacking any form of cryptographic integrity protection any
attacker can trivially easily modify the boot process of any such
Linux system and freely collected FDE passphrases entered. There’s
little value in signing the boot loader and kernel in a complex
bureaucracy if it then happily loads entirely unprotected code that
processes the actually relevant security credentials: the FDE
keys.

In my model, through use of unified kernels this important gap is
closed, hence UEFI SecureBoot code signing becomes an integral part of
the boot chain from firmware to the host OS. Unfortunately, code
signing – and having something a user can locally hack, is to some
level conflicting. However, I think we can improve the situation here,
and put more emphasis on enrolling developer keys in the trust chain
easily. Specifically, I see one relevant approach here: enrolling keys
directly in the firmware is something that we should make less of a
theoretical exercise and more something we can realistically
deploy. See this work in
progress

making this more automatic and eventually safe. Other approaches are
thinkable (including some that build on existing MokManager
infrastructure), but given the politics involved, are harder to
conclusively implement.

Running the OS itself in a container

What I explain above is put together with running on a bare metal
system in mind. However, one of the stated goals is to make the OS
adaptive enough to also run in a container environment (specifically:
systemd-nspawn) nicely. Booting a disk image on bare metal or in a
VM generally means that the UEFI firmware validates and invokes the
boot loader, and the boot loader invokes the kernel which then
transitions into the final system. This is different for containers:
here the container manager immediately calls the init system, i.e. PID
1. Thus the validation logic must be different: cryptographic
validation must be done by the container manager. In my model this is
solved by shipping the OS image not only with a Verity data partition
(as is already necessary for the UEFI SecureBoot trust chain, see
above), but also with another partition, containing a PKCS#7 signature
of the root hash of said Verity partition. This of course is exactly
what I propose for both the system extension and portable service
image. Thus, in my model the images for all three uses are put
together the same way: an immutable /usr/ partition, accompanied by
a Verity partition and a PKCS#7 signature partition. The OS image
itself then has two ways “into” the trust chain: either through the
signed unified kernel in the ESP (which is used for bare metal and VM
boots) or by using the PKCS#7 signature stored in the partition
(which is used for container/systemd-nspawn boots).

Parameterizing Kernels

A fully immutable and signed OS has to establish trust in the user
data it makes use of before doing so. In the model I describe here,
for /etc/ and /var/ we do this via disk encryption of the root
file system (in combination with integrity checking). But the point
where the root file system is mounted comes relatively late in the
boot process, and thus cannot be used to parameterize the boot
itself. In many cases it’s important to be able to parameterize the
boot process however.

For example, for the implementation of the developer mode indicated
above it’s useful to be able to pass this fact safely to the initrd,
in combination with other fields (e.g. hashed root password for
allowing in-initrd logins for debug purposes). After all, if the
initrd is pre-built by the vendor and signed as whole together with
the kernel it cannot be modified to carry such data directly (which is
in fact how parameterizing of the initrd to a large degree was traditionally
done).

In my model this is achieved through system
credentials
, which allow passing
parameters to systems (and services for the matter) in an encrypted
and authenticated fashion, bound to the TPM2 chip. This means that we
can securely pass data into the initrd so that it can be authenticated
and decrypted only on the system it is intended for and with the
unified kernel image it was intended for.

Swap

In my model the OS would also carry a swap partition. For the simple
reason that only then
systemd-oomd.service
can provide the best results. Also see In defence of swap: common
misconceptions

Updating Images

We have a rough idea how the system shall be organized now, let’s next
focus on the deployment cycle: software needs regular update cycles,
and software that is not updated regularly is a security
problem. Thus, I am sure that any modern system must be automatically
updated, without this requiring avoidable user interaction.

In my model, this is the job for
systemd-sysupdate. It’s
a relatively simple A/B image updater: it operates either on
partitions, on regular files in a directory, or on subdirectories in a
directory. Each entry has a version (which is encoded in the GPT
partition label for partitions, and in the filename for regular files
and directories): whenever an update is initiated the oldest version
is erased, and the newest version is downloaded.

With the setup described above a system update becomes a really simple
operation. On each update the systemd-sysupdate tool downloads a
/usr/ file system partition, an accompanying Verity partition, a
PKCS#7 signature partition, and drops it into the host’s partition
table (where it possibly replaces the oldest version so far stored
there). Then it downloads a unified kernel image and drops it into
the EFI System Partition’s /EFI/Linux (as per Boot Loader
Specification; possibly erase the oldest such file there). And that’s
already the whole update process: four files are downloaded from the
server, unpacked and put in the most straightforward of ways into the
partition table or file system. Unlike in other OS designs there’s no
mechanism required to explicitly switch to the newer version, the
aforementioned systemd-boot logic will automatically pick the newest
kernel once it is dropped in.

Above we talked a lot about modularity, and how to put systems
together as a combination of a host OS image, system extension images
for the initrd and the host, portable service images and
systemd-nspawn container images. I already emphasized that these
image files are actually always the same: GPT disk images with
partition definitions that match the Discoverable Partition
Specification. This comes very handy when thinking about updating: we
can use the exact same systemd-sysupdate tool for updating these
other images as we use for the host image. The uniformity of the
on-disk format allows us to update them uniformly too.

Boot Counting + Assessment

Automatic OS updates do not come without risks: if they happen
automatically, and an update goes wrong this might mean your system
might be automatically updated into a brick. This of course is less
than ideal. Hence it is essential to address this reasonably
automatically. In my model, there’s systemd’s Automatic Boot
Assessment
for
that. The mechanism is simple: whenever a new unified kernel image is
dropped into the system it will be stored with a small integer counter
value included in the filename. Whenever the unified kernel image is
selected for booting by systemd-boot, it is decreased by one. Once
the system booted up successfully (which is determined by userspace)
the counter is removed from the file name (which indicates “this entry
is known to work”). If the counter ever hits zero, this indicates that
it tried to boot it a couple of times, and each time failed, thus is
apparently “bad”. In this case systemd-boot will not consider the
kernel anymore, and revert to the next older (that doesn’t have a
counter of zero).

By sticking the boot counter into the filename of the unified kernel
we can directly attach this information to the kernel, and thus need
not concern ourselves with cleaning up secondary information about the
kernel when the kernel is removed. Updating with a tool like
systemd-sysupdate remains a very simple operation hence: drop one
old file, add one new file.

Picking the Newest Version

I already mentioned that systemd-boot automatically picks the newest
unified kernel image to boot, by looking at the version encoded in the
filename. This is done via a simple
strverscmp()
call (well, truth be told, it’s a modified version of that call,
different from the one implemented in libc, because real-life package
managers use more complex rules for comparing versions these days, and
hence it made sense to do that here too). The concept of having
multiple entries of some resource in a directory, and picking the
newest one automatically is a powerful concept, I think. It means
adding/removing new versions is extremely easy (as we discussed above,
in systemd-sysupdate context), and allows stateless determination of
what to use.

If systemd-boot can do that, what about system extension images,
portable service images, or systemd-nspawn container images that do
not actually use systemd-boot as the entrypoint? All these tools
actually implement the very same logic, but on the partition level: if
multiple suitable /usr/ partitions exist, then the newest is determined
by comparing the GPT partition label of them.

This is in a way the counterpart to the systemd-sysupdate update
logic described above: we always need a way to determine which
partition to actually then use after the update took place: and this
becomes very easy each time: enumerate possible entries, pick the
newest as per the (modified) strverscmp() result.

Home Directory Management

In my model the device’s users and their home directories are managed
by
systemd-homed. This
means they are relatively self-contained and can be migrated easily
between devices. The numeric UID assignment for each user is done at
the moment of login only, and the files in the home directory are
mapped as needed via a uidmap mount. It also allows us to protect
the data of each user individually with a credential that belongs to
the user itself. i.e. instead of binding confidentiality of the user’s
data to the system-wide full-disk-encryption each user gets their own
encrypted home directory where the user’s authentication token
(password, FIDO2 token, PKCS#11 token, recovery key…) is used as
authentication and decryption key for the user’s data. This brings
a major improvement for security as it means the user’s data is
cryptographically inaccessible except when the user is actually logged
in.

It also allows us to correct another major issue with traditional
Linux systems: the way how data encryption works during system
suspend. Traditionally on Linux the disk encryption credentials
(e.g. LUKS passphrase) is kept in memory also when the system is
suspended. This is a bad choice for security, since many (most?) of us
probably never turn off their laptop but suspend it instead. But if
the decryption key is always present in unencrypted form during the
suspended time, then it could potentially be read from there by a
sufficiently equipped attacker.

By encrypting the user’s home directory with the user’s authentication
token we can first safely “suspend” the home directory before going to
the system suspend state (i.e. flush out the cryptographic keys needed
to access it). This means any process currently accessing the home
directory will be frozen for the time of the suspend, but that’s
expected anyway during a system suspend cycle. Why is this better than
the status quo ante? In this model the home directory’s cryptographic
key material is erased during suspend, but it can be safely reacquired
on resume, from system code. If the system is only encrypted as a
whole however, then the system code itself couldn’t reauthenticate the
user, because it would be frozen too. By separating home directory
encryption from the root file system encryption we can avoid this
problem.

Partition Setup

So we discussed the organization of the partitions OS images multiple
times in the above, each time focusing on a specific aspect. Let’s
now summarize how this should look like all together.

In my model, the initial, shipped OS image should look roughly like this:

  • (1) An UEFI System Partition, with systemd-boot as boot loader and one unified kernel
  • (2) A /usr/ partition (version “A”), with a label fooOS_0.7 (under the assumption we called our project fooOS and the image version is 0.7).
  • (3) A Verity partition for the /usr/ partition (version “A”), with the same label
  • (4) A partition carrying the Verity root hash for the /usr/ partition (version “A”), along with a PKCS#7 signature of it, also with the same label

On first boot this is augmented by systemd-repart like this:

  • (5) A second /usr/ partition (version “B”), initially with a label _empty (which is the label systemd-sysupdate uses to mark partitions that currently carry no valid payload)
  • (6) A Verity partition for that (version “B”), similar to the above case, also labelled _empty
  • (7) And ditto a Verity root hash partition with a PKCS#7 signature (version “B”), also labelled _empty
  • (8) A root file system, encrypted and locked to the TPM2
  • (9) A home file system, integrity protected via a key also in TPM2 (encryption is unnecessary, since systemd-homed adds that on its own, and it’s nice to avoid duplicate encryption)
  • (10) A swap partition, encrypted and locked to the TPM2

Then, on the first OS update the partitions 5, 6, 7 are filled with a
new version of the OS (let’s say 0.8) and thus get their label
updated to fooOS_0.8. After a boot, this version is active.

On a subsequent update the three partitions fooOS_0.7 get wiped and
replaced by fooOS_0.9 and so on.

On factory reset, the partitions 8, 9, 10 are deleted, so that
systemd-repart recreates them, using a new set of cryptographic
keys.

Here’s a graphic that hopefully illustrates the partition stable from
shipped image, through first boot, multiple update cycles and eventual
factory reset:

Partitions Overview

Trust Chain

So let’s summarize the intended chain of trust (for bare metal/VM
boots) that ensures every piece of code in this model is signed
and validated, and any system secret is locked to TPM2.

  1. First, firmware (or possibly shim) authenticates systemd-boot.

  2. Once systemd-boot picks a unified kernel image to boot, it is
    also authenticated by firmware/shim.

  3. The unified kernel image contains an initrd, which is the first
    userspace component that runs. It finds any system extensions passed
    into the initrd, and sets them up through Verity. The kernel will
    validate the Verity root hash signature of these system extension
    images against its usual keyring.

  4. The initrd also finds credentials passed in, then securely unlocks
    (which means: decrypts + authenticates) them with a secret from the
    TPM2 chip, locked to the kernel image itself.

  5. The kernel image also contains a kernel command line which contains
    a usrhash= option that pins the root hash of the /usr/ partition
    to use.

  6. The initrd then unlocks the encrypted root file system, with a
    secret bound to the TPM2 chip.

  7. The system then transitions into the main system, i.e. the
    combination of the Verity protected /usr/ and the encrypted root
    files system. It then activates two more encrypted (and/or
    integrity protected) volumes for /home/ and swap, also with a
    secret tied to the TPM2 chip.

Here’s an attempt to illustrate the above graphically:

Trust Chain

This is the trust chain of the basic OS. Validation of system
extension images, portable service images, systemd-nspawn container
images always takes place the same way: the kernel validates these
Verity images along with their PKCS#7 signatures against the kernel’s
keyring.

File System Choice

In the above I left the choice of file systems unspecified. For the
immutable /usr/ partitions squashfs might be a good candidate, but
any other that works nicely in a read-only fashion and generates
reproducible results is a good choice, too. The home directories as managed
by systemd-homed should certainly use btrfs, because it’s the only
general purpose file system supporting online grow and shrink, which
systemd-homed can take benefit of, to manage storage.

For the root file system btrfs is likely also the best idea. That’s
because we intend to use LUKS/dm-crypt underneath, which by default
only provides confidentiality, not authenticity of the data (unless
combined with dm-integrity). Since btrfs (unlike xfs/ext4) does
full data checksumming it’s probably the best choice here, since it
means we don’t have to use dm-integrity (which comes at a higher
performance cost).

OS Installation vs. OS Instantiation

In the discussion above a lot of focus was put on setting up the OS
and completing the partition layout and such on first boot. This means
installing the OS becomes as simple as dd-ing (i.e. “streaming”) the
shipped disk image into the final HDD medium. Simple, isn’t it?

Of course, such a scheme is just too simple for many setups in real
life. Whenever multi-boot is required (i.e. co-installing an OS
implementing this model with another unrelated one), dd-ing a disk
image onto the HDD is going to overwrite user data that was supposed
to be kept around.

In order to cover for this case, in my model, we’d use
systemd-repart (again!) to allow streaming the source disk image
into the target HDD in a smarter, additive way. The tool after all is
purely additive: it will add in partitions or grow them if they are
missing or too small. systemd-repart already has all the necessary
provisions to not only create a partition on the target disk, but also
copy blocks from a raw installer disk. An install operation would then
become a two stop process: one invocation of systemd-repart that
adds in the /usr/, its Verity and the signature partition to the
target medium, populated with a copy of the same partition of the
installer medium. And one invocation of bootctl that installs the
systemd-boot boot loader in the ESP. (Well, there’s one thing
missing here: the unified OS kernel also needs to be dropped into the
ESP. For now, this can be done with a simple cp call. In the long
run, this should probably be something bootctl can do as well, if
told so.)

So, with this scheme we have a simple scheme to cover all bases: we
can either just dd an image to disk, or we can stream an image onto
an existing HDD, adding a couple of new partitions and files to the
ESP.

Of course, in reality things are more complex than that even: there’s
a good chance that the existing ESP is simply too small to carry
multiple unified kernels. In my model, the way to address this is by
shipping two slightly different systemd-repart partition definition
file sets: the ideal case when the ESP is large enough, and a
fallback case, where it isn’t and where we then add in an addition
XBOOTLDR partition (as per the Discoverable Partitions
Specification). In that mode the ESP carries the boot loader, but the
unified kernels are stored in the XBOOTLDR partition. This scenario is
not quite as simple as the XBOOTLDR-less scenario described first, but
is equally well supported in the various tools. Note that
systemd-repart can be told size constraints on the partitions it
shall create or augment, thus to implement this scheme it’s enough to
invoke the tool with the fallback partition scheme if invocation with
the ideal scheme fails.

Either way: regardless how the partitions, the boot loader and the
unified kernels ended up on the system’s hard disk, on first boot the
code paths are the same again: systemd-repart will be called to
augment the partition table with the root file system, and properly
encrypt it, as was already discussed earlier here. This means: all
cryptographic key material used for disk encryption is generated on
first boot only, the installer phase does not encrypt anything.

Live Systems vs. Installer Systems vs. Installed Systems

Traditionally on Linux three types of systems were common: “installed”
systems, i.e. that are stored on the main storage of the device and
are the primary place people spend their time in; “installer” systems
which are used to install them and whose job is to copy and setup the
packages that make up the installed system; and “live” systems, which
were a middle ground: a system that behaves like an installed system
in most ways, but lives on removable media.

In my model I’d like to remove the distinction between these three
concepts as much as possible: each of these three images should carry
the exact same /usr/ file system, and should be suitable to be
replicated the same way. Once installed the resulting image can also
act as an installer for another system, and so on, creating a certain
“viral” effect: if you have one image or installation it’s
automatically something you can replicate 1:1 with a simple
systemd-repart invocation.

Building Images According to this Model

The above explains how the image should look like and how its first
boot and update cycle will modify it. But this leaves one question
unanswered: how to actually build the initial image for OS instances
according to this model?

Note that there’s nothing too special about the images following this
model: they are ultimately just GPT disk images with Linux file
systems, following the Discoverable Partition Specification. This
means you can use any set of tools of your choice that can put
together GPT disk images for compliant images.

I personally would use mkosi for
this purpose though. It’s designed to generate compliant images, and
has a rich toolset for SecureBoot and signed/Verity file systems
already in place.

What is key here is that this model doesn’t depart from RPM and dpkg,
instead it builds on top of that: in this model they are excellent for
putting together images on the build host, but deployment onto the
runtime host does not involve individual packages.

I think one cannot underestimate the value traditional distributions
bring, regarding security, integration and general polishing. The
concepts I describe above are inherited from this, but depart from the
idea that distribution packages are a runtime concept and make it a
build-time concept instead.

Note that the above is pretty much independent from the underlying
distribution.

Final Words

I have no illusions, general purpose distributions are not going to
adopt this model as their default any time soon, and it’s not even my
goal that they do that. The above is my personal vision, and I
don’t expect people to buy into it 100%, and that’s fine. However,
what I am interested in is finding the overlaps, i.e. work with people
who buy 50% into this vision, and share the components.

My goals here thus are to:

  1. Get distributions to move to a model where images like this can be
    built from the distribution easily. Specifically this means that
    distributions make their OS hermetic in /usr/.

  2. Find the overlaps, share components with other projects to revisit
    how distributions are put together. This is already happening, see
    systemd-tmpfiles and systemd-sysuser support in various
    distributions, but I think there’s more to share.

  3. Make people interested in building actual real-world images based
    on general purpose distributions adhering to the model described
    above. I’d love a “GnomeBook” image with full trust properties,
    that is built from true Linux distros, such as Fedora or
    ArchLinux.

FAQ

  1. What about ostree? Doesn’t ostree already deliver what this blog story describes?

    ostree is fine technology, but in respect to security and
    robustness properties it’s not too interesting I think, because
    unlike image-based approaches it cannot really deliver
    integrity/robustness guarantees easily. To be able to trust an
    ostree setup you have to establish trust into the underlying
    file system first, and the complexity of the file system makes
    that challenging. To provide an effective offline-secure trust
    chain through the whole depth of the stack it is essential to
    cryptographically validate every single I/O operation. In an
    image-based model this is trivially easy, but in ostree model
    it’s with current file system technology not possible and even if
    this is added in one way or another in the future (though I am not
    aware of anyone doing file-based integrity that was compatible
    with ostree‘s hardlink farm model) I think validation is still
    at too high a level, since Linux file system developers made very
    clear their implementations are not robust to rogue images.

    With my design I want to deliver similar security guarantees as
    ChromeOS does, but ostree is much weaker there, and I see no
    perspective of this changing. In a way ostree‘s integrity checks
    are similar to RPM’s and enforced on download rather than on
    access. In the model I suggest above, it’s always on access, and
    thus safe towards offline attacks (i.e. evil maid attacks). In
    today’s world, I think offline security is absolutely necessary
    though.

    That said, ostree does have some benefits over the model
    described above: it naturally shares file system inodes if many of
    the modules/images involved share the same data. It’s thus more
    space efficient on disk (and thus also in RAM/cache to some
    degree) by default. In my model it would be up to the image
    builders to minimize shipping overly redundant disk images, by
    making good use of suitably composable system extensions.

  2. What about configuration management?

    At first glance immutable systems and configuration management
    don’t go that well together. However, do note, that in the model
    I propose above the root file system with all its contents,
    including /etc/ and /var/ is actually writable and can be
    modified like on any other typical Linux distribution. The only
    exception is /usr/ where the immutable OS is hermetic. That
    means configuration management tools should work just fine in this
    model – up to the point where they are used to install additional
    RPM/dpkg packages, because that’s something not allowed in the
    model above: packages need to be installed at image build time and
    thus on the image build host, not the runtime host.

  3. What about non-UEFI and non-TPM2 systems?

    The above is designed around the feature set of contemporary PCs,
    and this means UEFI and TPM2 being available (simply because the
    PC is pretty much defined by the Windows platform, and current
    versions of Windows require both).

    I think it’s important to make the best of the features of today’s
    PC hardware, and then find suitable fallbacks on more limited
    hardware. Specifically this means: if there’s desire to implement
    something like the this on non-UEFI or non-TPM2 hardware we should
    look for suitable fallbacks for the individual functionality, but
    generally try to add glue to the old systems so that conceptually
    they behave more like the new systems instead of the other way
    round. Or in other words: most of the above is not strictly tied
    to UEFI or TPM2, and for many cases already there are reasonably
    fallbacks in place for more limited systems. Of course, without
    TPM2 many of the security guarantees will be weakened.

  4. How would you name an OS built that way?

    I think a desktop OS built this way if it has the GNOME desktop
    should of course be called GnomeBook, to mimic the ChromeBook
    name. 😉

    But in general, I’d call hermetic, adaptive, immutable OSes like this “particles“.

How can you help?

  1. Help making Distributions Hermetic in /usr/!

    One of the core ideas of the approach described above is to make
    the OS hermetic in /usr/, i.e. make it carry a comprehensive
    description of what needs to be set up outside of it when
    instantiated. Specifically, this means that system users that are
    needed are declared in systemd-sysusers snippets, and skeleton
    files and directories are created via systemd-tmpfiles. Moreover
    additional partitions should be declared via systemd-repart
    drop-ins.

    At this point some distributions (such as Fedora) are (probably
    more by accident than on purpose) already mostly hermetic in
    /usr/, at least for the most basic parts of the OS. However,
    this is not complete: many daemons require to have specific
    resources set up in /var/ or /etc/ before they can work, and
    the relevant packages do not carry systemd-tmpfiles descriptions
    that add them if missing. So there are two ways you could help
    here: politically, it would be highly relevant to convince
    distributions that an OS that is hermetic in /usr/ is highly
    desirable and it’s a worthy goal for packagers to get there. More
    specifically, it would be desirable if RPM/dpkg packages would
    ship with enough systemd-tmpfiles information so that
    configuration files the packages strictly need for operation are
    symlinked (or copied) from /usr/share/factory/ if they are
    missing (even better of course would be if packages from their
    upstream sources on would just work with an empty /etc/ and
    /var/, and create themselves what they need and default to good
    defaults in absence of configuration files).

    Note that distributions that adopted systemd-sysusers,
    systemd-tmpfiles and the /usr/ merge are already quite close
    to providing an OS that is hermetic in /usr/. These were the
    big, the major advancements: making the image fully hermetic
    should be less controversial – at least that’s my guess.

    Also note that making the OS hermetic in /usr/ is not just useful in
    scenarios like the above. It also means that stuff like
    this

    and like
    this

    can work well.

  2. Fill in the gaps!

    I already mentioned a couple of missing bits and pieces in the
    implementation of the overall vision. In the systemd project
    we’d be delighted to review/merge any PRs that fill in the voids.

  3. Build your own OS like this!

    Of course, while we built all these building blocks and they have
    been adopted to various levels and various purposes in the various
    distributions, no one so far built an OS that puts things together
    just like that. It would be excellent if we had communities that
    work on building images like what I propose above. i.e. if you
    want to work on making a secure GnomeBook as I suggest above a
    reality that would be more than welcome.

    How could this look like specifically? Pick an existing
    distribution, write a set of mkosi descriptions plus some
    additional drop-in files, and then build this on some build
    infrastructure. While doing so, report the gaps, and help us
    address them.

Further Documentation of Used Components and Concepts

  1. systemd-tmpfiles
  2. systemd-sysusers
  3. systemd-boot
  4. systemd-stub
  5. systemd-sysext
  6. systemd-portabled, Portable Services Introduction
  7. systemd-repart
  8. systemd-nspawn
  9. systemd-sysupdate
  10. systemd-creds, System and Service Credentials
  11. systemd-homed
  12. Automatic Boot Assessment
  13. Boot Loader Specification
  14. Discoverable Partitions Specification
  15. Safely Building Images

Earlier Blog Stories Related to this Topic

  1. The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions
  2. The Wondrous World of Discoverable GPT Disk Images
  3. Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248
  4. Portable Services with systemd v239
  5. mkosi — A Tool for Generating OS Images

And that’s all for now.

Testing my System Code in /usr/ Without Modifying /usr/

Post Syndicated from original https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html

I recently
blogged

about how to run a volatile systemd-nspawn container from your
host’s /usr/ tree, for quickly testing stuff in your host
environment, sharing your home drectory, but all that without making a
single modification to your host, and on an isolated node.

The one-liner discussed in that blog story is great for testing during
system software development. Let’s have a look at another systemd
tool that I regularly use to test things during systemd development,
in a relatively safe environment, but still taking full benefit of my
host’s setup.

Since a while now, systemd has been shipping with a simple component
called
systemd-sysext. It’s
primary usecase goes something like this: on one hand OS systems with
immutable /usr/ hierarchies are fantastic for security, robustness,
updating and simplicity, but on the other hand not being able to
quickly add stuff to /usr/ is just annoying.

systemd-sysext is supposed to bridge this contradiction: when
invoked it will merge a bunch of “system extension” images into
/usr/ (and /opt/ as a matter of fact) through the use of read-only
overlayfs, making all files shipped in the image instantly and
atomically appear in /usr/ during runtime — as if they always had
been there. Now, let’s say you are building your locked down OS, with
an immutable /usr/ tree, and it comes without ability to log into,
without debugging tools, without anything you want and need when
trying to debug and fix something in the system. With systemd-sysext
you could use a system extension image that contains all this, drop it
into the system, and activate it with systemd-sysext so that it
genuinely extends the host system.

(There are many other usecases for this tool, for example, you could
build systems that way that at their base use a generic image, but by
installing one or more system extensions get extended to with
additional more specific functionality, or drivers, or similar. The
tool is generic, use it for whatever you want, but for now let’s not
get lost in listing all the possibilites.)

What’s particularly nice about the tool is that it supports
automatically discovered dm-verity images, with signatures and
everything. So you can even do this in a fully authenticated,
measured, safe way. But I am digressing…

Now that we (hopefully) have a rough understanding what
systemd-sysext is and does, let’s discuss how specficially we can
use this in the context of system software development, to safely use
and test bleeding edge development code — built freshly from your
project’s build tree – in your host OS without having to risk that the
host OS is corrupted or becomes unbootable by stuff that didn’t quite
yet work the way it was envisioned:

The images systemd-sysext merges into /usr/ can be of two kinds:
disk images with a file system/verity/signature, or simple, plain
directory trees. To make these images available to the tool, they can
be placed or symlinked into /usr/lib/extensions/,
/var/lib/extensions/, /run/lib/extensions/ (and a bunch of
others). So if we now install our freshly built development software
into a subdirectory of those paths, then that’s entirely sufficient to
make them valid system extension images in the sense of
systemd-sysext, and thus can be merged into /usr/ to try them out.

To be more specific: when I develop systemd itself, here’s what I do
regularly, to see how my new development version would behave on my
host system. As preparation I checked out the systemd development git
tree first of course, hacked around in it a bit, then built it with
meson/ninja. And now I want to test what I just built:

sudo DESTDIR=/run/extensions/systemd-test ninja -C build install &&
        sudo systemd-sysext refresh --force

Explanation: first, we’ll install my current build tree as a system
extension into /run/extensions/systemd-test/. And then we apply it
to the host via the systemd-sysext refresh command. This command
will search for all installed system extension images in the
aforementioned directories, then unmount (i.e. “unmerge”) any
previously merged dirs from /usr/ and then freshly mount
(i.e. “merge”) the new set of system extensions on top of /usr/. And
just like that, I have installed my development tree of systemd into
the host OS, and all that without actually modifying/replacing even a
single file on the host at all. Nothing here actually hit the disk!

Note that all this works on any system really, it is not necessary
that the underlying OS even is designed with immutability in
mind. Just because the tool was developed with immutable systems in
mind it doesn’t mean you couldn’t use it on traditional systems where
/usr/ is mutable as well. In fact, my development box actually runs
regular Fedora, i.e. is RPM-based and thus has a mutable /usr/
tree. As long as system extensions are applied the whole of /usr/
becomes read-only though.

Once I am done testing, when I want to revert to how things were without the image installed, it is sufficient to call:

sudo systemd-sysext unmerge

And there you go, all files my development tree generated are gone
again, and the host system is as it was before (and /usr/ mutable
again, in case one is on a traditional Linux distribution).

Also note that a reboot (regardless if a clean one or an abnormal
shutdown) will undo the whole thing automatically, since we installed
our build tree into /run/ after all, i.e. a tmpfs instance that is
flushed on boot. And given that the overlayfs merge is a runtime
thing, too, the whole operation was executed without any
persistence. Isn’t that great?

(You might wonder why I specified --force on the systemd-sysext
refresh
line earlier. That’s because systemd-sysext actually does
some minimal version compatibility checks when applying system
extension images. For that it will look at the host’s
/etc/os-release file with
/usr/lib/extension-release.d/extension-release.<name>, and refuse
operaton if the image is not actually built for the host OS
version. Here we don’t want to bother with dropping that file in
there, we know already that the extension image is compatible with
the host, as we just built it on it. --force allows us to skip the
version check.)

You might wonder: what about the combination of the idea from the
previous blog story (regarding running container’s off the host
/usr/ tree) with system extensions? Glad you asked. Right now we
have no support for this, but it’s high on our TODO list (patches
welcome, of course!). i.e. a new switch for systemd-nspawn called
--system-extension= that would allow merging one or more such
extensions into the container tree booted would be stellar. With that,
with a single command I could run a container off my host OS but with
a development version of systemd dropped in, all without any
persistence. How awesome would that be?

(Oh, and in case you wonder, all of this only works with distributions
that have completed the /usr/ merge. On legacy distributions that
didn’t do that and still place parts of /usr/ all over the hierarchy
the above won’t work, since merging /usr/ trees via overlayfs is
pretty pointess if the OS is not hermetic in /usr/.)

And that’s all for now. Happy hacking!

Running a Container off the Host /usr/

Post Syndicated from original https://0pointer.net/blog/running-an-container-off-the-host-usr.html

Apparently, in some parts of this
world
, the /usr/-merge
transition is still ongoing. Let’s take the opportunity to have a look
at one specific way to take benefit of the /usr/-merge (and
associated work) IRL.

I develop system-level software as you might know. Oftentimes I want
to run my development code on my PC but be reasonably sure it cannot
destroy or otherwise negatively affect my host system. Now I could set
up a container tree for that, and boot into that. But often I am too
lazy for that, I don’t want to bother with a slow package manager
setting up a new OS tree for me. So here’s what I often do instead —
and this only works because of the /usr/-merge.

I run a command like the following (without any preparatory work):

systemd-nspawn \
        --directory=/ \
        --volatile=yes \
        -U \
        --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \
        --set-credential=firstboot.locale:C.UTF-8 \
        --bind-user=lennart \
        -b

And then I very quickly get a login prompt on a container that runs
the exact same software as my host — but is also isolated from the
host. I do not need to prepare any separate OS tree or anything
else. It just works. And my host user lennart is just there,
ready for me to log into.

So here’s what these
systemd-nspawn
options specifically do:

  • --directory=/ tells systemd-nspawn to run off the host OS’
    file hierarchy. That smells like danger of course, running two
    OS instances off the same directory hierarchy. But don’t be
    scared, because:

  • --volatile=yes enables volatile mode. Specifically this means
    what we configured with --directory=/ as root file system is
    slightly rearranged. Instead of mounting that tree as it is, we’ll
    mount a tmpfs instance as actual root file system, and then
    mount the /usr/ subdirectory of the specified hierarchy into the
    /usr/ subdirectory of the container file hierarchy in read-only
    fashion – and only that directory. So now we have a container
    directory tree that is basically empty, but imports all host OS
    binaries and libraries into its /usr/ tree. All software
    installed on the host is also available in the container with no
    manual work. This mechanism only works because on /usr/-merged
    OSes vendor resources are monopolized at a single place:
    /usr/. It’s sufficient to share that one directory with the
    container to get a second instance of the host OS running. Note
    that this means /etc/ and /var/ will be entirely empty
    initially when this second system boots up. Thankfully, forward
    looking distributions (such as Fedora) have adopted
    systemd-tmpfiles
    and
    systemd-sysusers
    quite pervasively, so that system users and files/directories
    required for operation are created automatically should they be
    missing. Thus, even though at boot the mentioned directories are
    initially empty, once the system is booted up they are
    sufficiently populated for things to just work.

  • -U means we’ll enable user namespacing, in fully automatic
    mode. This does three things: it picks a free host UID range
    dynamically for the container, then sets up user namespacing for
    the container processes mapping host UID range to UIDs 0…65534 in
    the container. It then sets up a similar UID mapped mount on the
    /usr/ tree of the container. Net effect: file ownerships as set
    on the host OS tree appear as they belong to the very same users
    inside of the container environment, except that we use user
    namespacing for everything, and thus the users are actually
    neatly isolated from the host.

  • --set-credential=passwd.hashed-password.root:$(mkpasswd
    mysecret)
    passes a credential to the container. Credentials are
    bits of data that you can pass to systemd services and whole
    systems. They are actually awesome concepts (e.g. they support
    TPM2 authentication/encryption that just works!) but I am not going
    to go into details around that, given it’s off-topic in this
    specific scenario. Here we just take benefit of the fact that
    systemd-sysusers looks for a credential called
    passwd.hashed-password.root to initialize the root password of
    the system from. We set it to mysecret. This means once the
    system is booted up we can log in as root and the supplied
    password. Yay. (Remember, /etc/ is initially empty on this
    container, and thus also carries no /etc/passwd or
    /etc/shadow, and thus has no root user record, and thus no root
    password.)

    mkpasswd is a tool then
    converts a plain text password into a UNIX hashed password, which
    is what this specific credential expects.

  • Similar, --set-credential=firstboot.locale:C.UTF-8 tells the
    systemd-firstboot
    service in the container to initialize /etc/locale.conf with
    this locale.

  • --bind-user=lennart binds the host user lennart into the
    container, also as user lennart. This does two things: it mounts
    the host user’s home directory into the container. It also copies
    a minimal user record of the specified user into the container
    that
    nss-systemd
    then picks up and includes in the regular user database. This
    means, once the container is booted up I can log in as lennart
    with my regular password, and once I logged in I will see my
    regular host home directory, and can make changes to it. Yippieh!
    (This does a couple of more things, such as UID mapping and
    things, but let’s not get lost in too much details.)

So, if I run this, I will very quickly get a login prompt, where I can
log into as my regular user. I have full access to my host home
directory, but otherwise everyhing is nicely isolated from the host,
and changes outside of the home directory are either prohibited or are
volatile, i.e. go to a tmpfs instance whose lifetime is bound to the
container’s lifetime: when I shut down the container I just started,
then any changes outside of my user’s home directory are lost.

Note that while here I use --volatile=yes in combination with
--directory=/ you can actually use it on any OS hierarchy, i.e. just
about any directory that contains OS binaries.

Similar, the --bind-user= stuff works with any OS hierarchy too (but
do note that only systemd 249 and newer will pick up the user records
passed to the container that way, i.e. this requires at least v249
both on the host and in the container to work).

Or in short: the possibilities are endless!

Requirements

For this all to work, you need:

  1. A recent kernel (5.15 should suffice, as it brings UID mapped
    mounts for the most common file systems, so that -U and
    --bind-user= can work well.)

  2. A recent systemd (249 should suffice, which brings --bind-user=,
    and a -U switch backed by UID mapped mounts).

  3. A distribution that adopted systemd-tmpfiles and
    systemd-sysusers so that the directory hierarchy and user
    databases are automatically populated when empty at boot. (Fedora
    35 should suffice.)

Limitations

While a lot of today’s software actually out of the box works well on
systems that come up with an unpopulated /etc/ and /var/, and
either fall back to reasonable built-in defaults, or deploy
systemd-tmpfiles to create what is missing, things aren’t perfect:
some software typically installed an desktop OSes will fail to start
when invoked in such a container, and be visible as ugly failed
services, but it won’t stop me from logging in and using the system
for what I want to use it. It would be excellent to get that fixed,
though. This can either be fixed in the relevant software upstream
(i.e. if opening your configuration file fails with ENOENT, then
just default to reasonable defaults), or in the distribution packaging
(i.e. add a
tmpfiles.d/
file that copies or symlinks in skeleton configuration from
/usr/share/factory/etc/ via the C or L line types).

And then there’s certain software dealing with hardware management and
similar that simply cannot work in a container (as device APIs on
Linux are generally not virtualized for containers) reasonably. It
would be excellent if software like that would be updated to carry
ConditionVirtualization=!container or
ConditionPathIsReadWrite=/sys conditionalization in their unit
files, so that it is automatically – cleanly – skipped when executed
in such a container environment.

And that’s all for now.

Authenticated Boot and Disk Encryption on Linux

Post Syndicated from original http://0pointer.net/blog/authenticated-boot-and-disk-encryption-on-linux.html

The Strange State of Authenticated Boot and Disk Encryption on Generic Linux Distributions

TL;DR: Linux has been supporting Full Disk Encryption (FDE) and
technologies such as UEFI SecureBoot and TPMs for a long
time. However, the way they are set up by most distributions is not as
secure as they should be, and in some ways quite frankly weird. In
fact, right now, your data is probably more secure if stored on
current ChromeOS, Android, Windows or MacOS devices, than it is on
typical Linux distributions.

Generic Linux distributions (i.e. Debian, Fedora, Ubuntu, …) adopted
Full Disk Encryption (FDE) more than 15 years ago, with the
LUKS/cryptsetup infrastructure. It was a big step forward to a more
secure environment. Almost ten years ago the big distributions started
adding UEFI SecureBoot to their boot process. Support for Trusted
Platform Modules (TPMs) has been added to the distributions a long
time ago as well — but even though many PCs/laptops these days have
TPM chips on-board it’s generally not used in the default setup of
generic Linux distributions.

How these technologies currently fit together on generic Linux
distributions doesn’t really make too much sense to me — and falls
short of what they could actually deliver. In this story I’d like to
have a closer look at why I think that, and what I propose to do about
it.

The Basic Technologies

Let’s have a closer look what these technologies actually deliver:

  1. LUKS/dm-crypt/cryptsetup provide disk encryption, and optionally
    data authentication. Disk encryption means that reading the data in
    clear-text form is only possible if you possess a secret of some
    form, usually a password/passphrase. Data authentication means that
    no one can make changes to the data on disk unless they possess a
    secret of some form. Most distributions only enable the former
    though — the latter is a more recent addition to LUKS/cryptsetup,
    and is not used by default on most distributions (though it
    probably should be). Closely related to LUKS/dm-crypt is
    dm-verity (which can authenticate immutable volumes) and
    dm-integrity (which can authenticate writable volumes, among
    other things).

  2. UEFI SecureBoot provides mechanisms for authenticating boot loaders
    and other pre-OS binaries before they are invoked. If those boot
    loaders then authenticate the next step of booting in a similar
    fashion there’s a chain of trust which can ensure that only code
    that has some level of trust associated with it will run on the
    system. Authentication of boot loaders is done via cryptographic
    signatures: the OS/boot loader vendors cryptographically sign their
    boot loader binaries. The cryptographic certificates that may be
    used to validate these signatures are then signed by Microsoft, and
    since Microsoft’s certificates are basically built into all of
    today’s PCs and laptops this will provide some basic trust chain:
    if you want to modify the boot loader of a system you must have
    access to the private key used to sign the code (or to the private
    keys further up the certificate chain).

  3. TPMs do many things. For this text we’ll focus one facet: they can
    be used to protect secrets (for example for use in disk encryption,
    see above), that are released only if the code that booted the host
    can be authenticated in some form. This works roughly like this:
    every component that is used during the boot process (i.e. code,
    certificates, configuration, …) is hashed with a cryptographic hash
    function before it is used. The resulting hash is written to some
    small volatile memory the TPM maintains that is write-only (the so
    called Platform Configuration Registers, “PCRs”): each step of the
    boot process will write hashes of the resources needed by the next
    part of the boot process into these PCRs. The PCRs cannot be
    written freely: the hashes written are combined with what is
    already stored in the PCRs — also through hashing and the result of
    that then replaces the previous value. Effectively this means: only
    if every component involved in the boot matches expectations the
    hash values exposed in the TPM PCRs match the expected values
    too. And if you then use those values to unlock the secrets you
    want to protect you can guarantee that the key is only released to
    the OS if the expected OS and configuration is booted. The process
    of hashing the components of the boot process and writing that to
    the TPM PCRs is called “measuring”. What’s also important to
    mention is that the secrets are not only protected by these PCR
    values but encrypted with a “seed key” that is generated on the TPM
    chip itself, and cannot leave the TPM (at least so goes the
    theory). The idea is that you cannot read out a TPM’s seed key, and
    thus you cannot duplicate the chip: unless you possess the
    original, physical chip you cannot retrieve the secret it might be
    able to unlock for you. Finally, TPMs can enforce a limit on unlock
    attempts per time (“anti-hammering”): this makes it hard to brute
    force things: if you can only execute a certain number of unlock
    attempts within some specific time then brute forcing will be
    prohibitively slow.

How Linux Distributions use these Technologies

As mentioned already, Linux distributions adopted the first two
of these technologies widely, the third one not so much.

So typically, here’s how the boot process of Linux distributions works
these days:

  1. The UEFI firmware invokes a piece of code called “shim” (which is
    stored in the EFI System Partition — the “ESP” — of your system),
    that more or less is just a list of certificates compiled into code
    form. The shim is signed with the aforementioned Microsoft key,
    that is built into all PCs/laptops. This list of certificates then
    can be used to validate the next step of the boot process. The shim
    is measured by the firmware into the TPM. (Well, the shim can do a
    bit more than what I describe here, but this is outside of the
    focus of this article.)

  2. The shim then invokes a boot loader (often Grub) that is signed by
    a private key owned by the distribution vendor. The boot loader is
    stored in the ESP as well, plus some other places (i.e. possibly a
    separate boot partition). The corresponding certificate is included
    in the list of certificates built into the shim. The boot loader
    components are also measured into the TPM.

  3. The boot loader then invokes the kernel and passes it an initial
    RAM disk image (initrd), which contains initial userspace code. The
    kernel itself is signed by the distribution vendor too. It’s also
    validated via the shim. The initrd is not validated, though
    (!). The kernel is measured into the TPM, the initrd sometimes too.

  4. The kernel unpacks the initrd image, and invokes what is contained
    in it. Typically, the initrd then asks the user for a password for
    the encrypted root file system. The initrd then uses that to set up
    the encrypted volume. No code authentication or TPM measurements
    take place.

  5. The initrd then transitions into the root file system. No code
    authentication or TPM measurements take place.

  6. When the OS itself is up the user is prompted for their user name,
    and their password. If correct, this will unlock the user account:
    the system is now ready to use. At this point no code
    authentication, no TPM measurements take place. Moreover, the
    user’s password is not used to unlock any data, it’s used only to
    allow or deny the login attempt — the user’s data has already been
    decrypted a long time ago, by the initrd, as mentioned above.

What you’ll notice here of course is that code validation happens for
the shim, the boot loader and the kernel, but not for the initrd or
the main OS code anymore. TPM measurements might go one step further:
the initrd is measured sometimes too, if you are lucky. Moreover, you
might notice that the disk encryption password and the user password
are inquired by code that is not validated, and is thus not safe from
external manipulation. You might also notice that even though TPM
measurements of boot loader/OS components are done nothing actually
ever makes use of the resulting PCRs in the typical setup.

Attack Scenarios

Of course, before determining whether the setup described above makes
sense or not, one should have an idea what one actually intends to
protect against.

The most basic attack scenario to focus on is probably that you want
to be reasonably sure that if someone steals your laptop that contains
all your data then this data remains confidential. The model described
above probably delivers that to some degree: the full disk encryption
when used with a reasonably strong password should make it hard for
the laptop thief to access the data. The data is as secure as the
password used is strong. The attacker might attempt to brute force the
password, thus if the password is not chosen carefully the attacker
might be successful.

Two more interesting attack scenarios go something like this:

  1. Instead of stealing your laptop the attacker takes the harddisk
    from your laptop while you aren’t watching (e.g. while you went for
    a walk and left it at home or in your hotel room), makes a copy of
    it, and then puts it back. You’ll never notice they did that. The
    attacker then analyzes the data in their lab, maybe trying to brute
    force the password. In this scenario you won’t even know that your
    data is at risk, because for you nothing changed — unlike in the
    basic scenario above. If the attacker manages to break your
    password they have full access to the data included on it,
    i.e. everything you so far stored on it, but not necessarily on
    what you are going to store on it later. This scenario is worse
    than the basic one mentioned above, for the simple fact that you
    won’t know that you might be attacked. (This scenario could be
    extended further: maybe the attacker has a chance to watch you type
    in your password or so, effectively lowering the password
    strength.)

  2. Instead of stealing your laptop the attacker takes the harddisk
    from your laptop while you aren’t watching, inserts backdoor code
    on it, and puts it back. In this scenario you won’t know your data
    is at risk, because physically everything is as before. What’s
    really bad though is that the attacker gets access to anything you
    do on your laptop, both the data already on it, and whatever you
    will do in the future.

I think in particular this backdoor attack scenario is something we
should be concerned about. We know for a fact that attacks like that
happen all the time (Pegasus, industry espionage, …), hence we should
make them hard.

Are we Safe?

So, does the scheme so far implemented by generic Linux distributions
protect us against the latter two scenarios? Unfortunately not at
all. Because distributions set up disk encryption the way they do, and
only bind it to a user password, an attacker can easily duplicate the
disk, and then attempt to brute force your password. What’s worse:
since code authentication ends at the kernel — and the initrd is not
authenticated anymore —, backdooring is trivially easy: an attacker
can change the initrd any way they want, without having to fight any
kind of protections. And given that FDE unlocking is implemented in
the initrd, and it’s the initrd that asks for the encryption password
things are just too easy: an attacker could trivially easily insert
some code that picks up the FDE password as you type it in and send it
wherever they want. And not just that: since once they are in they are
in, they can do anything they like for the rest of the system’s
lifecycle, with full privileges — including installing backdoors for
versions of the OS or kernel that are installed on the device in the
future, so that their backdoor remains open for as long as they like.

That is sad of course. It’s particular sad given that the other
popular OSes all address this much better. ChromeOS, Android, Windows
and MacOS all have way better built-in protections against attacks
like this. And it’s why one can certainly claim that your data is
probably better protected right now if you store it on those OSes then
it is on generic Linux distributions.

(Yeah, I know that there are some niche distros which do this better,
and some hackers hack their own. But I care about general purpose
distros here, i.e. the big ones, that most people base their work on.)

Note that there are more problems with the current setup. For example,
it’s really weird that during boot the user is queried for an FDE
password which actually protects their data, and then once the system
is up they are queried again – now asking for a username, and another
password. And the weird thing is that this second authentication that
appears to be user-focused doesn’t really protect the user’s data
anymore — at that moment the data is already unlocked and
accessible. The username/password query is supposed to be useful in
multi-user scenarios of course, but how does that make any sense,
given that these multiple users would all have to know a disk
encryption password that unlocks the whole thing during the FDE step,
and thus they have access to every user’s data anyway if they make an
offline copy of the harddisk?

Can we do better?

Of course we can, and that is what this story is actually supposed to
be about.

Let’s first figure out what the minimal issues we should fix are (at
least in my humble opinion):

  1. The initrd must be authenticated before being booted into. (And
    measured unconditionally.)

  2. The OS binary resources (i.e. /usr/) must be authenticated before
    being booted into. (But don’t need to be encrypted, since everyone
    has the same anyway, there’s nothing to hide here.)

  3. The OS configuration and state (i.e. /etc/ and /var/) must be
    encrypted, and authenticated before they are used. The encryption
    key should be bound to the TPM device; i.e system data should be
    locked to a security concept belonging to the system, not the user.

  4. The user’s home directory (i.e. /home/lennart/ and similar) must
    be encrypted and authenticated. The unlocking key should be bound
    to a user password or user security token (FIDO2 or PKCS#11 token);
    i.e. user data should be locked to a security concept belonging to
    the user, not the system.

Or to summarize this differently:

  1. Every single component of the boot
    process and OS needs to be authenticated, i.e. all of shim (done),
    boot loader (done), kernel (done), initrd (missing so far), OS binary
    resources (missing so far), OS configuration and state (missing so
    far), the user’s home (missing so far).

  2. Encryption is necessary for the OS configuration and state (bound
    to TPM), and for the user’s home directory (bound to a user
    password or user security token).

In Detail

Let’s see how we can achieve the above in more detail.

How to Authenticate the initrd

At the moment initrds are generated on the installed host via scripts
(dracut and similar) that try to figure out a minimal set of binaries
and configuration data to build an initrd that contains just enough to
be able to find and set up the root file system. What is included in
the initrd hence depends highly on the individual installation and its
configuration. Pretty likely no two initrds generated that way will be
fully identical due to this. This model clearly has benefits: the
initrds generated this way are very small and minimal, and support
exactly what is necessary for the system to boot, and not less or
more. It comes with serious drawbacks too though: the generation
process is fragile and sometimes more akin to black magic than
following clear rules: the generator script natively has to understand
a myriad of storage stacks to determine what needs to be included and
what not. It also means that authenticating the image is hard: given
that each individual host gets a different specialized initrd, it
means we cannot just sign the initrd with the vendor key like we sign
the kernel. If we want to keep this design we’d have to figure out
some other mechanism (e.g. a per-host signature key – that is
generated locally; or by authenticating it with a message
authentication code bound to the TPM). While these approaches are
certainly thinkable, I am not convinced they actually are a good idea
though: locally and dynamically generated per-host initrds is
something we probably should move away from.

If we move away from locally generated initrds, things become a lot
simpler. If the distribution vendor generates the initrds on their
build systems then it can be attached to the kernel image itself, and
thus be signed and measured along with the kernel image, without any
further work. This simplicity is simply lovely. Besides robustness and
reproducibility this gives us an easy route to authenticated initrds.

But of course, nothing is really that simple: working with
vendor-generated initrds means that we can’t adjust them anymore to
the specifics of the individual host: if we pre-build the initrds and
include them in the kernel image in immutable fashion then it becomes
harder to support complex, more exotic storage or to parameterize it
with local network server information, credentials, passwords, and so
on. Now, for my simple laptop use-case these things don’t matter,
there’s no need to extend/parameterize things, laptops and their
setups are not that wildly different. But what to do about the cases
where we want both: extensibility to cover for less common storage
subsystems (iscsi, LVM, multipath, drivers for exotic hardware…) and
parameterization?

Here’s a proposal how to achieve that: let’s build a basic initrd into
the kernel as suggested, but then do two things to make this scheme
both extensible and parameterizable, without compromising security.

  1. Let’s define a way how the basic initrd can be extended with
    additional files, which are stored in separate “extension
    images”. The basic initrd should be able to discover these extension
    images, authenticate them and then activate them, thus extending
    the initrd with additional resources on-the-fly.

  2. Let’s define a way how we can safely pass additional parameters to
    the kernel/initrd (and actually the rest of the OS, too) in an
    authenticated (and possibly encrypted) fashion. Parameters in this
    context can be anything specific to the local installation,
    i.e. server information, security credentials, certificates, SSH
    server keys, or even just the root password that shall be able to
    unlock the root account in the initrd …

In such a scheme we should be able to deliver everything we are
looking for:

  1. We’ll have a full trust chain for the code: the boot loader will
    authenticate and measure the kernel and basic initrd. The initrd
    extension images will then be authenticated by the basic initrd
    image.

  2. We’ll have authentication for all the parameters passed to the
    initrd.

This so far sounds very unspecific? Let’s make it more specific by
looking closer at the components I’d suggest to be used for this
logic:

  1. The systemd suite since a few months contains a subsystem
    implementing system extensions (v248). System extensions are
    ultimately just disk images (for example a squashfs file system in
    a GPT envelope) that can extend an underlying OS tree. Extending
    in this regard means they simply add additional files and
    directories into the OS tree, i.e. below /usr/. For a longer
    explanation see
    systemd-sysext(8). When
    a system extension is activated it is simply mounted and then
    merged into the main /usr/ tree via a read-only overlayfs
    mount. Now what’s particularly nice about them in this context we
    are talking about here is that the extension images may carry
    dm-verity authentication data, and PKCS#7 signatures (once this
    is merged, that
    is, i.e. v250
    ).

  2. The systemd suite also contains a concept called service
    “credentials”. These are small pieces of information passed to
    services in a secure way. One key feature of these credentials is
    that they can be encrypted and authenticated in a very simple way
    with a key bound to the TPM (v250). See
    LoadCredentialEncrypted=
    and
    systemd-creds(1)
    for details. They are great for safely storing SSL private keys and
    similar on your system, but they also come handy for parameterizing
    initrds: an encrypted credential is just a file that can only be
    decoded if the right TPM is around with the right PCR values set.

  3. The systemd suite contains a component called
    systemd-stub(7). It’s
    an EFI stub, i.e. a small piece of code that is attached to a
    kernel image, and turns the kernel image into a regular EFI binary
    that can be directly executed by the firmware (or a boot
    loader). This stub has a number of nice features (for example, it
    can show a boot splash before invoking the Linux kernel itself and
    such). Once this work is
    merged (v250)
    the stub
    will support one more feature: it will automatically search for
    system extension image files and credential files next to the
    kernel image file, measure them and pass them on to the main initrd
    of the host.

Putting this together we have nice way to provide fully authenticated
kernel images, initrd images and initrd extension images; as well as
encrypted and authenticated parameters via the credentials logic.

How would a distribution actually make us of this? A distribution
vendor would pre-build the basic initrd, and glue it into the kernel
image, and sign that as a whole. Then, for each supposed extension of
the basic initrd (e.g. one for iscsi support, one for LVM, one for
multipath, …), the vendor would use a tool such as
mkosi to build an extension image,
i.e. a GPT disk image containing the files in squashfs format, a
Verity partition that authenticates it, plus a PKCS#7 signature
partition that validates the root hash for the dm-verity partition,
and that can be checked against a key provided by the boot loader or
main initrd. Then, any parameters for the initrd will be encrypted
using systemd-creds encrypt
-T
. The
resulting encrypted credentials and the initrd extension images are
then simply placed next to the kernel image in the ESP (or boot
partition). Done.

This checks all boxes: everything is authenticated and measured, the
credentials also encrypted. Things remain extensible and modular, can
be pre-built by the vendor, and installation is as simple as dropping
in one file for each extension and/or credential.

How to Authenticate the Binary OS Resources

Let’s now have a look how to authenticate the Binary OS resources,
i.e. the stuff you find in /usr/, i.e. the stuff traditionally
shipped to the user’s system via RPMs or DEBs.

I think there are three relevant ways how to authenticate this:

  1. Make /usr/ a dm-verity volume. dm-verity is a concept
    implemented in the Linux kernel that provides authenticity to
    read-only block devices: every read access is cryptographically
    verified against a top-level hash value. This top-level
    hash is typically a 256bit value that you can either encode in the
    kernel image you are using, or cryptographically sign (which is
    particularly nice once this is
    merged
    ). I think
    this is actually the best approach since it makes the /usr/ tree
    entirely immutable in a very simple way. However, this also means
    that the whole of /usr/ needs to be updated as once, i.e. the
    traditional rpm/apt based update logic cannot work in this
    mode.

  2. Make /usr/ a dm-integrity volume. dm-integrity is a concept
    provided by the Linux kernel that offers integrity guarantees to
    writable block devices, i.e. in some ways it can be considered to be
    a bit like dm-verity while permitting write access. It can be
    used in three ways, one of which I think is particularly relevant
    here. The first way is with a simple hash function in “stand-alone”
    mode: this is not too interesting here, it just provides greater
    data safety for file systems that don’t hash check their files’ data
    on their own. The second way is in combination with dm-crypt,
    i.e. with disk encryption. In this case it adds authenticity to
    confidentiality: only if you know the right secret you can read and
    make changes to the data, and any attempt to make changes without
    knowing this secret key will be detected as IO error on next read
    by those in possession of the secret (more about this below). The
    third way is the one I think is most interesting here: in
    “stand-alone” mode, but with a keyed hash function
    (e.g. HMAC). What’s this good for? This provides authenticity
    without encryption: if you make changes to the disk without knowing
    the secret this will be noticed on the next read attempt of the
    data and result in IO errors. This mode provides what we want
    (authenticity) and doesn’t do what we don’t need (encryption). Of
    course, the secret key for the HMAC must be provided somehow, I
    think ideally by the TPM.

  3. Make /usr/ a dm-crypt (LUKS) + dm-integrity volume. This
    provides both authenticity and encryption. The latter isn’t
    typically needed for /usr/ given that it generally contains no
    secret data: anyone can download the binaries off the Internet
    anyway, and the sources too. By encrypting this you’ll waste CPU
    cycles, but beyond that it doesn’t hurt much. (Admittedly, some
    people might want to hide the precise set of packages they have
    installed, since it of course does reveal a bit of information
    about you: i.e. what you are working on, maybe what your job is –
    think: if you are a hacker you have hacking tools installed – and
    similar). Going this way might simplify things in some cases, as it
    means you don’t have to distinguish “OS binary resources” (i.e
    /usr/) and “OS configuration and state” (i.e. /etc/ + /var/,
    see below), and just make it the same volume. Here too, the secret
    key must be provided somehow, I think ideally by the TPM.

All three approach are valid. The first approach has my primary
sympathies, but for distributions not willing to abandon client-side
updates via RPM/dpkg this is not an option, in which case I would
propose the other two approaches for these cases.

The LUKS encryption key (and in case of dm-integrity standalone mode
the key for the keyed hash function) should be bound to the TPM. Why
the TPM for this? You could also use a user password, a FIDO2 or
PKCS#11 security token — but I think TPM is the right choice: why
that? To reduce the requirement for repeated authentication, i.e. that
you first have to provide the disk encryption password, and then you
have to login, providing another password. It should be possible that
the system boots up unattended and then only one authentication prompt
is needed to unlock the user’s data properly. The TPM provides a way
to do this in a reasonably safe and fully unattended way. Also, when
we stop considering just the laptop use-case for a moment: on servers
interactive disk encryption prompts don’t make much sense — the fact
that TPMs can provide secrets without this requiring user interaction
and thus the ability to work in entirely unattended environments is
quite desirable. Note that
crypttab(5)
as implemented by systemd (v248) provides native support for
authentication via password, via TPM2, via PKCS#11 or via FIDO2, so
the choice is ultimately all yours.

How to Encrypt/Authenticate OS Configuration and State

Let’s now look at the OS configuration and state, i.e. the stuff in
/etc/ and /var/. It probably makes sense to not consider these two
hierarchies independently but instead just consider this to be the
root file system. If the OS binary resources are in a separate file
system it is then mounted onto the /usr/ sub-directory of the root
file system.

The OS configuration and state (or: root file system) should be both
encrypted and authenticated: it might contain secret keys, user
passwords, privileged logs and similar. This data matters and contains
plenty data that should remain confidential.

The encryption of choice here is dm-crypt (LUKS) + dm-integrity
similar as discussed above, again with the key bound to the TPM.

If the OS binary resources are protected the same way it is safe to
merge these two volumes and have a single partition for both (see
above)

How to Encrypt/Authenticate the User’s Home Directory

The data in the user’s home directory should be encrypted, and bound
to the user’s preferred token of authentication (i.e. a password or
FIDO2/PKCS#11 security token). As mentioned, in the traditional mode
of operation the user’s home directory is not individually encrypted,
but only encrypted because FDE is in use. The encryption key for that
is a system wide key though, not a per-user key. And I think that’s
problem, as mentioned (and probably not even generally understood by
our users). We should correct that and ensure that the user’s password
is what unlocks the user’s data.

In the systemd suite we provide a service
systemd-homed(8)
(v245) that implements this in a safe way: each user gets its own LUKS
volume stored in a loopback file in /home/, and this is enough to
synthesize a user account. The encryption password for this volume is
the user’s account password, thus it’s really the password provided at
login time that unlocks the user’s data. systemd-homed also supports
other mechanisms of authentication, in particular PKCS#11/FIDO2
security tokens. It also provides support for other storage back-ends
(such as fscrypt), but I’d always suggest to use the LUKS back-end
since it’s the only one providing the comprehensive confidentiality
guarantees one wants for a UNIX-style home directory.

Note that there’s one special caveat here: if the user’s home
directory (e.g. /home/lennart/) is encrypted and authenticated, what
about the file system this data is stored on, i.e. /home/ itself? If
that dir is part of the the root file system this would result in
double encryption: first the data is encrypted with the TPM root file
system key, and then again with the per-user key. Such double
encryption is a waste of resources, and unnecessary. I’d thus suggest
to make /home/ its own dm-integrity volume with a HMAC, keyed by
the TPM. This means the data stored directly in /home/ will be
authenticated but not encrypted. That’s good not only for performance,
but also has practical benefits: it allows extracting the encrypted
volume of the various users in case the TPM key is lost, as a way to
recover from dead laptops or similar.

Why authenticate /home/, if it only contains per-user home
directories that are authenticated on their own anyway? That’s a
valid question: it’s because the kernel file system maintainers made
clear that Linux file system code is not considered safe against rogue
disk images, and is not tested for that; this means before you mount
anything you need to establish trust in some way because otherwise
there’s a risk that the act of mounting might exploit your kernel.

Summary of Resources and their Protections

So, let’s now put this all together. Here’s a table showing the
various resources we deal with, and how I think they should be
protected (in my idealized world).

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources yes no dm-verity root hash linked into kernel image, or firmware/initrd certificate database top-level partition
OS configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This should provide all the desired guarantees: everything is
authenticated, and the individualized per-host or per-user data
is also encrypted. No double encryption takes place. The encryption
keys/verification certificates are stored/bound to the most appropriate
infrastructure.

Does this address the three attack scenarios mentioned earlier? I
think so, yes. The basic attack scenario I described is addressed by
the fact that /var/, /etc/ and /home/*/ are encrypted. Brute
forcing the former two is harder than in the status quo ante model,
since a high entropy key is used instead of one derived from a user
provided password. Moreover, the “anti-hammering” logic of the TPM
will make brute forcing prohibitively slow. The home directories are
protected by the user’s password or ideally a personal FIDO2/PKCS#11
security token in this model. Of course, a password isn’t better
security-wise then the status quo ante. But given the FIDO2/PKCS#11
support built into systemd-homed it should be easier to lock down
the home directories securely.

Binding encryption of /var/ and /etc/ to the TPM also addresses
the first of the two more advanced attack scenarios: a copy of the
harddisk is useless without the physical TPM chip, since the seed key
is sealed into that. (And even if the attacker had the chance to watch
you type in your password, it won’t help unless they possess access to
to the TPM chip.) For the home directory this attack is not addressed
as long as a plain password is used. However, since binding home
directories to FIDO2/PKCS#11 tokens is built into systemd-homed
things should be safe here too — provided the user actually possesses
and uses such a device.

The backdoor attack scenario is addressed by the fact that every
resource in play now is authenticated: it’s hard to backdoor the OS if
there’s no component that isn’t verified by signature keys or TPM
secrets the attacker hopefully doesn’t know.

For general purpose distributions that focus on updating the OS per
RPM/dpkg the idealized model above won’t work out, since (as
mentioned) this implies an immutable /usr/, and thus requires
updating /usr/ via an atomic update operation. For such distros a
setup like the following is probably more realistic, but see above.

Resource Needs Authentication Needs Encryption Suggested Technology Validation/Encryption Keys/Certificates acquired via Stored where
Shim yes no SecureBoot signature verification firmware certificate database ESP
Boot loader yes no ditto firmware certificate database/shim ESP/boot partition
Kernel yes no ditto ditto ditto
initrd yes no ditto ditto ditto
initrd parameters yes yes systemd TPM encrypted credentials TPM ditto
initrd extensions yes no systemd-sysext with Verity+PKCS#7 signatures firmware/initrd certificate database ditto
OS binary resources, configuration and state yes yes dm-crypt (LUKS) + dm-integrity TPM top-level partition
/home/ itself yes no dm-integrity with HMAC TPM top-level partition
User home directories yes yes dm-crypt (LUKS) + dm-integrity in loopback files User password/FIDO2/PKCS#11 security token loopback file inside /home partition

This means there’s only one root file system that contains all of
/etc/, /var/ and /usr/.

Recovery Keys

When binding encryption to TPMs one problem that arises is what
strategy to adopt if the TPM is lost, due to hardware failure: if I
need the TPM to unlock my encrypted volume, what do I do if I need the
data but lost the TPM?

The answer here is supporting recovery keys (this is similar to how
other OSes approach this). Recovery keys are pretty much the same
concept as passwords. The main difference being that they are computer
generated rather than user-chosen. Because of that they typically have
much higher entropy (which makes them more annoying to type in, i.e
you want to use them only when you must, not day-to-day). By having
higher entropy they are useful in combination with TPM, FIDO2 or
PKCS#11 based unlocking: unlike a combination with passwords they do
not compromise the higher strength of protection that
TPM/FIDO2/PKCS#11 based unlocking is supposed to provide.

Current versions of
systemd-cryptenroll(1)
implement a recovery key concept in an attempt to address this
problem. You may enroll any combination of TPM chips, PKCS#11 tokens,
FIDO2 tokens, recovery keys and passwords on the same LUKS
volume. When enrolling a recovery key it is generated and shown on
screen both in text form and as QR code you can scan off screen if you
like. The idea is write down/store this recovery key at a safe place so
that you can use it when you need it. Note that such recovery keys can
be entered wherever a LUKS password is requested, i.e. after
generation they behave pretty much the same as a regular password.

TPM PCR Brittleness

Locking devices to TPMs and enforcing a PCR policy with this
(i.e. configuring the TPM key to be unlockable only if certain PCRs
match certain values, and thus requiring the OS to be in a certain
state) brings a problem with it: TPM PCR brittleness. If the key you
want to unlock with the TPM requires the OS to be in a specific state
(i.e. that all OS components’ hashes match certain expectations or
similar) then doing OS updates might have the affect of making your
key inaccessible: the OS updates will cause the code to change, and
thus the hashes of the code, and thus certain PCRs. (Thankfully, you
unrolled a recovery key, as described above, so this doesn’t mean you
lost your data, right?).

To address this I’d suggest three strategies:

  1. Most importantly: don’t actually use the TPM PCRs that contain code
    hashes. There are actually multiple PCRs
    defined
    ,
    each containing measurements of different aspects of the boot
    process. My recommendation is to bind keys to PCR 7 only, a PCR
    that contains measurements of the UEFI SecureBoot certificate
    databases. Thus, the keys will remain accessible as long as these
    databases remain the same, and updates to code will not affect it
    (updates to the certificate databases will, and they do happen too,
    though hopefully much less frequent then code updates). Does this
    reduce security? Not much, no, because the code that’s run is after
    all not just measured but also validated via code signatures, and
    those signatures are validated with the aforementioned certificate
    databases. Thus binding an encrypted TPM key to PCR 7 should
    enforce a similar level of trust in the boot/OS code as binding it
    to a PCR with hashes of specific versions of that code. i.e. using
    PCR 7 means you say “every code signed by these vendors is allowed
    to unlock my key” while using a PCR that contains code hashes means
    “only this exact version of my code may access my key”.

  2. Use LUKS key management to enroll multiple versions of the TPM keys
    in relevant volumes, to support multiple versions of the OS code
    (or multiple versions of the certificate database, as discussed
    above). Specifically: whenever an update is done that might result
    changing the relevant PCRs, pre-calculate the new PCRs, and enroll
    them in an additional LUKS slot on the relevant volumes. This means
    that the unlocking keys tied to the TPM remain accessible in both
    states of the system. Eventually, once rebooted after the update,
    remove the old slots.

  3. If these two strategies didn’t work out (maybe because the
    OS/firmware was updated outside of OS control, or the update
    mechanism was aborted at the wrong time) and the TPM PCRs changed
    unexpectedly, and the user now needs to use their recovery key to
    get access to the OS back, let’s handle this gracefully and
    automatically reenroll the current TPM PCRs at boot, after the
    recovery key checked out, so that for future boots everything is in
    order again.

Other approaches can work too: for example, some OSes simply remove
TPM PCR policy protection of disk encryption keys altogether
immediately before OS or firmware updates, and then reenable it right
after. Of course, this opens a time window where the key bound to the
TPM is much less protected than people might assume. I’d try to avoid
such a scheme if possible.

Anything Else?

So, given that we are talking about idealized systems: I personally
actually think the ideal OS would be much simpler, and thus more
secure than this:

I’d try to ditch the Shim, and instead focus on enrolling the
distribution vendor keys directly in the UEFI firmware certificate
list. This is actually supported by all firmwares too. This has
various benefits: it’s no longer necessary to bind everything to
Microsoft’s root key, you can just enroll your own stuff and thus make
sure only what you want to trust is trusted and nothing else. To make
an approach like this easier, we have been working on doing automatic
enrollment of these keys from the systemd-boot boot loader, see
this work in progress for
details
. This way the
Firmware will authenticate the boot loader/kernel/initrd without any
further component for this in place.

I’d also not bother with a separate boot partition, and just use the
ESP for everything. The ESP is required anyway by the firmware, and is
good enough for storing the few files we need.

FAQ

Can I implement all of this in my distribution today?

Probably not. While the big issues have mostly been addressed there’s
a lot of integration work still missing. As you might have seen I
linked some PRs that haven’t even been merged into our tree yet, and
definitely not been released yet or even entered the distributions.

Will this show up in Fedora/Debian/Ubuntu soon?

I don’t know. I am making a proposal how these things might work, and
am working on getting various building blocks for this into
shape. What the distributions do is up to them. But even if they don’t
follow the recommendations I make 100%, or don’t want to use the
building blocks I propose I think it’s important they start thinking
about this, and yes, I think they should be thinking about defaulting
to setups like this.

Work for measuring/signing initrds on Fedora has been started,
here’s a slide deck with some information about
it
.

But isn’t a TPM evil?

Some corners of the community tried (unfortunately successfully to
some degree) to paint TPMs/Trusted Computing/SecureBoot as generally
evil technologies that stop us from using our systems the way we
want. That idea is rubbish though, I think. We should focus on what it
can deliver for us (and that’s a lot I think, see above), and
appreciate the fact we can actually use it to kick out perceived evil
empires from our devices instead of being subjected to them. Yes, the
way SecureBoot/TPMs are defined puts you in the driver seat if you
want — and you may enroll your own certificates to keep out everything
you don’t like.

What if my system doesn’t have a TPM?

TPMs are becoming quite ubiquitous, in particular as the upcoming
Windows versions will require them. In general I think we should focus
on modern, fully equipped systems when designing all this, and then
find fall-backs for more limited systems. Frankly it feels as if so
far the design approach for all this was the other way round: try to
make the new stuff work like the old rather than the old like the new
(I mean, to me it appears this thinking is the main raison d’être for
the Grub boot loader).

More specifically, on the systems where we have no TPM we ultimately
cannot provide the same security guarantees as for those which
have. So depending on the resource to protect we should fall back to
different TPM-less mechanisms. For example, if we have no TPM then the
root file system should probably be encrypted with a user provided
password, typed in at boot as before. And for the encrypted boot
credentials we probably should simply not encrypt them, and place them
in the ESP unencrypted.

Effectively this means: without TPM you’ll still get protection regarding the
basic attack scenario, as before, but not the other two.

What if my system doesn’t have UEFI?

Many of the mechanisms explained above taken individually do not
require UEFI. But of course the chain of trust suggested above requires
something like UEFI SecureBoot. If your system lacks UEFI it’s
probably best to find work-alikes to the technologies suggested above,
but I doubt I’ll be able to help you there.

rpm/dpkg already cryptographically validates all packages at installation time (gpg), why would I need more than that?

This type of package validation happens once: at the moment of
installation (or update) of the package, but not anymore when the data
installed is actually used. Thus when an attacker manages to modify
the package data after installation and before use they can make any
change they like without this ever being noticed. Such package download
validation does address certain attack scenarios
(i.e. man-in-the-middle attacks on network downloads), but it doesn’t
protect you from attackers with physical access, as described in the
attack scenarios above.

Systems such as ostree aren’t better than rpm/dpkg regarding this
BTW, their data is not validated on use either, but only during
download or when processing tree checkouts.

Key really here is that the scheme explained here provides offline
protection for the data “at rest” — even someone with physical access
to your device cannot easily make changes that aren’t noticed on next
use. rpm/dpkg/ostree provide online protection only: as long as the
system remains up, and all OS changes are done through the intended
program code-paths, and no one has physical access everything should
be good. In today’s world I am sure this is not good enough though. As
mentioned most modern OSes provide offline protection for the data at
rest in one way or another. Generic Linux distributions are terribly
behind on this.

This is all so desktop/laptop focused, what about servers?

I am pretty sure servers should provide similar security guarantees as
outlined above. In a way servers are a much simpler case: there are no
users and no interactivity. Thus the discussion of /home/ and what
it contains and of user passwords doesn’t matter. However, the
authenticated initrd and the unattended TPM-based encryption I think
are very important for servers too, in a trusted data center
environment. It provides security guarantees so far not given by Linux
server OSes.

I’d like to help with this, or discuss/comment on this

Submit patches or reviews through
GitHub. General discussion about
this is best done on the systemd mailing
list
.

The Wondrous World of Discoverable GPT Disk Images

Post Syndicated from original http://0pointer.net/blog/the-wondrous-world-of-discoverable-gpt-disk-images.html

TL;DR: Tag your GPT partitions with the right, descriptive partition
types, and the world will become a better place.

A number of years ago we started the Discoverable Partitions
Specification
which
defines GPT
partition type UUIDs and partition flags for the various partitions
Linux systems typically deal with. Before the specification all Linux
partitions usually just used the same type, basically saying “Hey, I
am a Linux partition” and not much else. With this specification the
GPT partition type, flags and label system becomes a lot more
expressive, as it can tell you:

  1. What kind of data a partition contains (i.e. is this swap data, a file system or Verity data?)
  2. What the purpose/mount point of a partition is (i.e. is this a /home/ partition or a root file system?)
  3. What CPU architecture a partition is intended for (i.e. is this a root partition for x86-64 or for aarch64?)
  4. Shall this partition be mounted automatically? (i.e. without specifically be configured via /etc/fstab)
  5. And if so, shall it be mounted read-only?
  6. And if so, shall the file system be grown to its enclosing partition size, if smaller?
  7. Which partition contains the newer version of the same data (i.e. multiple root file systems, with different versions)

By embedding all of this information inside the GPT partition table
disk images become self-descriptive: without requiring any other
source of information (such as /etc/fstab) if you look at a
compliant GPT disk image it is clear how an image is put together and
how it should be used and mounted. This self-descriptiveness in
particular breaks one philosophical weirdness of traditional Linux
installations: the original source of information which file system
the root file system is, typically is embedded in the root file system
itself, in /etc/fstab. Thus, in a way, in order to know what the
root file system is you need to know what the root file system is. 🤯
🤯 🤯

(Of course, the way this recursion is traditionally broken up is by
then copying the root file system information from /etc/fstab into
the boot loader configuration, resulting in a situation where the
primary source of information for this — i.e. /etc/fstab — is
actually mostly irrelevant, and the secondary source — i.e. the copy
in the boot loader — becomes the configuration that actually matters.)

Today, the GPT partition type UUIDs defined by the specification have
been adopted quite widely, by distributions and their installers, as
well as a variety of partitioning tools and other tools.

In this article I want to highlight how the various tools the
systemd project provides make use of the
concepts the specification introduces.

But before we start with that, let’s underline why tagging partitions
with these descriptive partition type UUIDs (and the associated
partition flags) is a good thing, besides the philosophical points
made above.

  1. Simplicity: in particular OS installers become simpler — adjusting
    /etc/fstab as part of the installation is not necessary anymore,
    as the partitioning step already put all information into place for
    assembling the system properly at boot. i.e. installing doesn’t
    mean that you always have to get fdisk and /etc/fstab into
    place, the former suffices entirely.

  2. Robustness: since partition tables mostly remain static after
    installation the chance of corruption is much lower than if the
    data is stored in file systems (e.g. in /etc/fstab). Moreover by
    associating the metadata directly with the objects it describes the
    chance of things getting out of sync is reduced. (i.e. if you lose
    /etc/fstab, or forget to rerun your initrd builder you still know
    what a partition is supposed to be just by looking at it.)

  3. Programmability: if partitions are self-descriptive it’s much
    easier to automatically process them with various tools. In fact,
    this blog story is mostly about that: various systemd tools can
    naturally process disk images prepared like this.

  4. Alternative entry points: on traditional disk images, the boot
    loader needs to be told which kernel command line option root= to
    use, which then provides access to the root file system, where
    /etc/fstab is then found which describes the rest of the file
    systems. Where precisely root= is configured for the boot loader
    highly depends on the boot loader and distribution used, and is
    typically encoded in a Turing complete programming language
    (Grub…). This makes it very hard to automatically determine the
    right root file system to use, to implement alternative entry points
    to the system. By alternative entry points I mean other ways to boot
    the disk image, specifically for running it as a systemd-nspawn
    container — but this extends to other mechanisms where the boot
    loader may be bypassed to boot up the system, for example qemu
    when configured without a boot loader.

  5. User friendliness: it’s simply a lot nicer for the user looking at
    a partition table if the partition table explains what is what,
    instead of just saying “Hey, this is a Linux partition!” and
    nothing else.

Uses for the concept

Now that we cleared up the Why?, lets have a closer look how this is
currently used and exposed in systemd‘s various components.

Use #1: Running a disk image in a container

If a disk image follows the Discoverable Partition Specification then
systemd-nspawn
has all it needs to just boot it up. Specifically, if you have a GPT
disk image in a file foobar.raw and you want to boot it up in a
container, just run systemd-nspawn -i foobar.raw -b, and that’s it
(you can specify a block device like /dev/sdb too if you like). It
becomes easy and natural to prepare disk images that can be booted
either on a physical machine, inside a virtual machine manager or
inside such a container manager: the necessary meta-information is
included in the image, easily accessible before actually looking into
its file systems.

Use #2: Booting an OS image on bare-metal without /etc/fstab or kernel command line root=

If a disk image follows the specification in many cases you can remove
/etc/fstab (or never even install it) — as the basic information
needed is already included in the partition table. The
systemd-gpt-auto-generator
logic implements automatic discovery of the root file system as well
as all auxiliary file systems. (Note that the former requires an
initrd that uses systemd, some more conservative distributions do not
support that yet, unfortunately). Effectively this means you can boot
up a kernel/initrd with an entirely empty kernel command line, and the
initrd will automatically find the root file system (by looking for a
suitably marked partition on the same drive the EFI System Partition
was found on).

(Note, if /etc/fstab or root= exist and contain relevant
information they always takes precedence over the automatic logic. This
is in particular useful to tweaks thing by specifying additional mount
options and such.)

Use #3: Mounting a complex disk image for introspection or manipulation

The
systemd-dissect
tool may be used to introspect and manipulate OS disk images that
implement the specification. If you pass the path to a disk image (or
block device) it will extract various bits of useful information from
the image (e.g. what OS is this? what partitions to mount?) and display it.

With the --mount switch a disk image (or block device) can be
mounted to some location. This is useful for looking what is inside
it, or changing its contents. This will dissect the image and then
automatically mount all contained file systems matching their GPT
partition description to the right places, so that you subsequently
could chroot into it. (But why chroot if you can just use systemd-nspawn? 😎)

Use #4: Copying files in and out of a disk image

The
systemd-dissect
tool also has two switches --copy-from and --copy-to which allow
copying files out of or into a compliant disk image, taking all
included file systems and the resulting mount hierarchy into account.

Use #5: Running services directly off a disk image

The
RootImage=
setting in service unit files accepts paths to compliant disk images
(or block device nodes), and can mount them automatically, running
service binaries directly off them (in chroot() style). In fact,
this is the base for the Portable
Service
concept of systemd.

Use #6: Provisioning disk images

systemd provides various tools that can run operations provisioning
disk images in an “offline” mode. Specifically:

systemd-tmpfiles

With the --image= switch
systemd-tmpfiles
can directly operate on a disk image, and for example create all
directories and other inodes defined in its declarative configuration
files included in the image. This can be useful for example to set up
the /var/ or /etc/ tree according to such configuration before
first boot.

systemd-sysusers

Similar, the --image= switch of
systemd-sysusers
tells the tool to read the declarative system user specifications
included in the image and synthesizes system users from it, writing
them to the /etc/passwd (and related) files in the image. This is
useful for provisioning these users before the first boot, for example
to ensure UID/GID numbers are pre-allocated, and such allocations not
delayed until first boot.

systemd-machine-id-setup

The --image= switch of
systemd-machine-id-setup
may be used to provision a fresh machine ID into
/etc/machine-id
of a disk image, before first boot.

systemd-firstboot

The --image= switch of
systemd-firstboot
may be used to set various basic system setting (such as root
password, locale information, hostname, …) on the specified disk
image, before booting it up.

Use #7: Extracting log information

The
journalctl
switch --image= may be used to show the journal log data included in
a disk image (or, as usual, the specified block device). This is very
useful for analyzing failed systems offline, as it gives direct access
to the logs without any further, manual analysis.

Use #8: Automatic repartitioning/growing of file systems

The
systemd-repart
tool may be used to repartition a disk or image in an declarative and
additive way. One primary use-case for it is to run during boot on
physical or VM systems to grow the root file system to the disk size,
or to add in, format, encrypt, populate additional partitions at boot.

With its --image= switch it the tool may operate on compliant disk
images in offline mode of operation: it will then read the partition
definitions that shall be grown or created off the image itself, and
then apply them to the image. This is particularly useful in
combination with the --size= which allows growing disk images to the
specified size.

Specifically, consider the following work-flow: you download a
minimized disk image foobar.raw that contains only the minimized
root file system (and maybe an ESP, if you want to boot it on
bare-metal, too). You then run systemd-repart --image=foo.raw
--size=15G
to enlarge the image to the 15G, based on the declarative
rules defined in the
repart.d/
drop-in files included in the image (this means this can grow the root
partition, and/or add in more partitions, for example for /srv or
so, maybe encrypted with a locally generated key or so). Then, you
proceed to boot it up with systemd-nspawn --image=foo.raw -b, making
use of the full 15G.

Versioning + Multi-Arch

Disk images implementing this specifications can carry OS executables in one of three ways:

  1. Only a root file system

  2. Only a /usr/ file system (in which case the root file system is automatically picked as tmpfs).

  3. Both a root and a /usr/file system (in which case the two are
    combined, the /usr/ file system mounted into the root file system,
    and the former possibly in read-only fashion`)

They may also contain OS executables for different architectures,
permitting “multi-arch” disk images that can safely boot up on
multiple CPU architectures. As the root and /usr/ partition type
UUIDs are specific to architectures this is easily done by including
one such partition for x86-64, and another for aarch64. If the
image is now used on an x86-64 system automatically the former
partition is used, on aarch64 the latter.

Moreover, these OS executables may be contained in different versions,
to implement a simple versioning scheme: when tools such as
systemd-nspawn or systemd-gpt-auto-generator dissect a disk image,
and they find two or more root or /usr/ partitions of the same type
UUID, they will automatically pick the one whose GPT partition label
(a 36 character free-form string every GPT partition may have) is the
newest according to
strverscmp()
(OK, truth be told, we don’t use strverscmp() as-is, but a modified
version with some more modern syntax and semantics, but conceptually
identical).

This logic allows to implement a very simple and natural A/B update
scheme: an updater can drop multiple versions of the OS into separate
root or /usr/ partitions, always updating the partition label to the
version included there-in once the download is complete. All of the
tools described here will then honour this, and always automatically
pick the newest version of the OS.

Verity

When building modern OS appliances, security is highly
relevant. Specifically, offline security matters: an attacker with
physical access should have a difficult time modifying the OS in a way
that isn’t noticed. i.e. think of a car or a cell network base
station: these appliances are usually parked/deployed in environments
attackers can get physical access to: it’s essential that in this case
the OS itself sufficiently protected, so that the attacker cannot just
mount the OS file system image, make modifications (inserting a
backdoor, spying software or similar) and the system otherwise
continues to run without this being immediately detected.

A great way to implement offline security is via Linux’ dm-verity
subsystem: it allows to securely bind immutable disk IO to a single,
short trusted hash value: if an attacker manages to offline modify the
disk image the modified disk image won’t match the trusted hash
anymore, and will not be trusted anymore (depending on policy this
then just result in IO errors being generated, or automatic
reboot/power-off).

The Discoverable Partitions Specification declares how to include
Verity validation data in disk images, and how to relate them to the file
systems they protect, thus making if very easy to deploy and work with
such protected images. For example systemd-nspawn supports a
--root-hash= switch, which accepts the Verity root hash and then
will automatically assemble dm-verity with this, automatically
matching up the payload and verity partitions. (Alternatively, just
place a .roothash file next to the image file).

Future

The above already is a powerful tool set for working with disk
images. However, there are some more areas I’d like to extend this
logic to:

bootctl

Similar to the other tools mentioned above,
bootctl
(which is a tool to interface with the boot loader, and install/update
systemd’s own EFI boot loader
sd-boot)
should learn a --image= switch, to make installation of the boot
loader on disk images easy and natural. It would automatically find
the ESP and other relevant partitions in the image, and copy the boot
loader binaries into them (or update them).

coredumpctl

Similar to the existing journalctl --image= logic the coredumpctl
tool should also gain an --image= switch for extracting coredumps
from compliant disk images. The combination of journalctl --image=
and coredumpctl --image= would make it exceptionally easy to work
with OS disk images of appliances and extracting logging and debugging
information from them after failures.

And that’s all for now. Please refer to the specification and the man
pages for further details. If your distribution’s installer does not
yet tag the GPT partition it creates with the right GPT type UUIDs,
consider asking them to do so.

Thank you for your time.

File Descriptor Limits

Post Syndicated from original http://0pointer.net/blog/file-descriptor-limits.html

TL;DR: don’t use select() + bump the RLIMIT_NOFILE soft limit to
the hard limit in your modern programs.

The primary way to reference, allocate and pin runtime OS resources on
Linux today are file descriptors (“fds”). Originally they were used to
reference open files and directories and maybe a bit more, but today
they may be used to reference almost any kind of runtime resource in
Linux userspace, including open devices, memory
(memfd_create(2)),
timers
(timefd_create(2))
and even processes (with the new
pidfd_open(2)
system call). In a way, the philosophically skewed UNIX concept of
“everything is a file” through the proliferation of fds actually
acquires a bit of sensible meaning: “everything has a file
descriptor” is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend
to have to deal with substantially more fds at the same time than they
traditionally did. Today, you’ll often encounter real-life programs
that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file
descriptors: once you hit the resource limit configured via
RLIMIT_NOFILE
any attempt to allocate more is refused with the EMFILE error —
until you close a couple of those you already have open.

Because fds weren’t such a universal concept traditionally, the limit
of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux
kernel first invokes userspace it still sets RLIMIT_NOFILE to a low
value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft
limit is what matters and causes the EMFILE issues, the hard limit
is a secondary limit that processes may bump their soft limit to — if
they like — without requiring further privileges to do so. Bumping the
limit further would require privileges however.). A limit of 1024 fds
made fds a scarce resource: APIs tried to be careful with using fds,
since you simply couldn’t have that many of them at the same
time. This resulted in some questionable coding decisions and
concepts at various places: often secondary descriptors that are very
similar to fds — but were not actually fds — were introduced
(e.g. inotify watch descriptors), simply to avoid for them the low
limits enforced on true fds. Or code tried to aggressively close fds
when not absolutely needing them (e.g. ftw()/nftw()), losing the
nice + stable “pinning” effect of open fds.

Worse though is that certain OS level APIs were designed having only
the low limits in mind. The worst offender being the BSD/POSIX
select(2)
system call: it only works with fds in the numeric range of 0…1023
(aka FD_SETSIZE-1). If you have an fd outside of this range, tough
luck: select() won’t work, and only if you are lucky you’ll detect
that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is
guaranteed that the lowest unused integer is allocated for new
fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024
everything remains compatible with select(): the resulting fds will
also be below 1024. Yay. If we’d bump the soft limit above this
threshold though and at some point in time an fd higher than the
threshold is allocated, this fd would not be compatible with
select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE
resource limit today for every userspace process is problematic: as
long as there’s userspace code still using select() doing so will
risk triggering hard-to-handle, hard-to-debug errors all over the
place.

However, given the nowadays ubiquitous use of fds for all
kinds of resources (did you know, an eBPF program is an fd? and a
cgroup too? and attaching an eBPF program to cgroup is another fd? …),
we’d really like to raise the limit anyway. 🤔

So before we continue thinking about this problem, let’s make the
problem more complex (…uh, I mean… “more exciting”) first. Having just
one hard and one soft per-process limit on fds is boring. Let’s add
more limits on fds to the mix. Specifically on Linux there are two
system-wide sysctls: fs.nr_open and fs.file-max. (Don’t ask me why
one uses a dash and the other an underscore, or why there are two of
them…) On today’s kernels they kinda lost their relevance. They had
some originally, because fds weren’t accounted by any other
counter. But today, the kernel tracks fds mostly as small pieces of
memory allocated on userspace requests — because that’s ultimately
what they are —, and thus charges them to the memory accounting done
anyway.

So now, we have four limits (actually: five if you count the memory
accounting) on the same kind of resource, and all of them make a
resource artificially scarce that we don’t want to be scarce. So what
to do?

Back in systemd v240 already (i.e. 2019) we decided to do something
about it. Specifically:

  • Automatically at boot we’ll now bump the two sysctls to their
    maximum, making them effectively ineffective. This one was easy. We
    got rid of two pretty much redundant knobs. Nice!

  • The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay,
    cheap fds! You may have an fd, and you, and you as well,
    everyone may have an fd!

  • But … we left the soft RLIMIT_NOFILE limit at 1024. We weren’t
    quite ready to break all programs still using select() in 2019
    yet. But it’s not as bad as it might sound I think: given the hard
    limit is bumped every program can easily opt-in to a larger number
    of fds, by setting the soft limit to the hard limit early on —
    without requiring privileges.

So effectively, with this approach fds should be much less scarce (at
least for programs that opt into that), and the limits should be much
easier to configure, since there are only two knobs now one really
needs to care about:

  • Configure the RLIMIT_NOFILE hard limit to the maximum number of
    fds you actually want to allow a process.

  • In the program code then either bump the soft to the hard limit, or
    not. If you do, you basically declare “I understood the problem, I
    promise to not use select(), drown me fds please!”. If you don’t
    then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change
was even scarcer than fds traditionally were (ha, fun!). We got
reports from pretty much only two projects that were bitten by the
change (one being a JVM implementation): they already bumped their
soft limit automatically to their hard limit during program
initialization, and then allocated an array with one entry per
possible fd. With the new high limit this resulted in one massive
allocation that traditionally was just a few K, and this caused memory
checks to be hit.

Anyway, here’s the take away of this blog story:

  • Don’t use select() anymore in 2021. Use poll(), epoll,
    iouring, …, but for heaven’s sake don’t use select(). It might
    have been all the rage in the 1990s but it doesn’t scale and is
    simply not designed for today’s programs. I wished the man page of
    select() would make clearer how icky it is and that there are
    plenty of more preferably APIs.

  • If you hack on a program that potentially uses a lot of fds, add
    some simple
    code

    somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit
    to the hard limit. But if you do this, you have to make sure your
    code (and any code that you link to from it) refrains from using
    select(). (Note: there’s at least one glibc NSS plugin using
    select() internally. Given that NSS modules can end up being
    loaded into pretty much any process such modules should probably
    be considered just buggy.)

  • If said program you hack on forks off foreign programs, make sure to
    reset the RLIMIT_NOFILE soft limit back to
    1024

    for them. Just because your program might be fine with fds >= 1024
    it doesn’t mean that those foreign programs might. And unfortunately
    RLIMIT_NOFILE is inherited down the process tree unless explicitly
    set.

And that’s all I have for today. I hope this was enlightening.

Unlocking LUKS2 volumes with TPM2, FIDO2, PKCS#11 Security Hardware on systemd 248

Post Syndicated from original http://0pointer.net/blog/unlocking-luks2-volumes-with-tpm2-fido2-pkcs11-security-hardware-on-systemd-248.html

TL;DR: It’s now easy to unlock your LUKS2 volume with a FIDO2
security token (e.g. YubiKey, Nitrokey FIDO2, AuthenTrend
ATKey.Pro). And TPM2 unlocking is easy now too.

Blogging is a lot of work, and a lot less fun than hacking. I mostly
focus on the latter because of that, but from time to time I guess
stuff is just too interesting to not be blogged about. Hence here,
finally, another blog story about exciting new features in systemd.

With the upcoming systemd v248 the
systemd-cryptsetup
component of systemd (which is responsible for assembling encrypted
volumes during boot) gained direct support for unlocking encrypted
storage with three types of security hardware:

  1. Unlocking with FIDO2 security tokens (well, at least with those
    which implement the hmac-secret extension; most do). i.e. your
    YubiKeys (series 5 and above), Nitrokey FIDO2, AuthenTrend
    ATKey.Pro and such.

  2. Unlocking with TPM2 security chips (pretty ubiquitous on non-budget
    PCs/laptops/…)

  3. Unlocking with PKCS#11 security tokens, i.e. your smartcards and
    older YubiKeys (the ones that implement PIV). (Strictly speaking
    this was supported on older systemd already, but was a lot more
    “manual”.)

For completeness’ sake, let’s keep in mind that the component also
allows unlocking with these more traditional mechanisms:

  1. Unlocking interactively with a user-entered passphrase (i.e. the
    way most people probably already deploy it, supported since
    about forever)

  2. Unlocking via key file on disk (optionally on removable media
    plugged in at boot), supported since forever.

  3. Unlocking via a key acquired through trivial
    AF_UNIX/SOCK_STREAM socket IPC. (Also new in v248)

  4. Unlocking via recovery keys. These are pretty much the same
    thing as a regular passphrase (and in fact can be entered wherever
    a passphrase is requested) — the main difference being that they
    are always generated by the computer, and thus have guaranteed high
    entropy, typically higher than user-chosen passphrases. They are
    generated in a way they are easy to type, in many cases even if the
    local key map is misconfigured. (Also new in v248)

In this blog story, let’s focus on the first three items, i.e. those
that talk to specific types of hardware for implementing unlocking.

To make working with security tokens and TPM2 easy, a new, small tool
was added to the systemd tool set:
systemd-cryptenroll. It’s
only purpose is to make it easy to enroll your security token/chip of
choice into an encrypted volume. It works with any LUKS2 volume, and
embeds a tiny bit of meta-information into the LUKS2 header with
parameters necessary for the unlock operation.

Unlocking with FIDO2

So, let’s see how this fits together in the FIDO2 case. Most likely
this is what you want to use if you have one of these fancy FIDO2 tokens
(which need to implement the hmac-secret extension, as
mentioned). Let’s say you already have your LUKS2 volume set up, and
previously unlocked it with a simple passphrase. Plug in your token,
and run:

# systemd-cryptenroll --fido2-device=auto /dev/sda5

(Replace /dev/sda5 with the underlying block device of your volume).

This will enroll the key as an additional way to unlock the volume,
and embeds all necessary information for it in the LUKS2 volume
header. Before we can unlock the volume with this at boot, we need to
allow FIDO2 unlocking via
/etc/crypttab. For
that, find the right entry for your volume in that file, and edit it
like so:

myvolume /dev/sda5 - fido2-device=auto

Replace myvolume and /dev/sda5 with the right volume name, and
underlying device of course. Key here is the fido2-device=auto
option you need to add to the fourth column in the file. It tells
systemd-cryptsetup to use the FIDO2 metadata now embedded in the
LUKS2 header, wait for the FIDO2 token to be plugged in at boot
(utilizing systemd-udevd, …) and unlock the volume with it.

And that’s it already. Easy-peasy, no?

Note that all of this doesn’t modify the FIDO2 token itself in any
way. Moreover you can enroll the same token in as many volumes as you
like. Since all enrollment information is stored in the LUKS2 header
(and not on the token) there are no bounds on any of this. (OK, well,
admittedly, there’s a cap on LUKS2 key slots per volume, i.e. you
can’t enroll more than a bunch of keys per volume.)

Unlocking with PKCS#11

Let’s now have a closer look how the same works with a PKCS#11
compatible security token or smartcard. For this to work, you need a
device that can store an RSA key pair. I figure most security
tokens/smartcards that implement PIV qualify. How you actually get the
keys onto the device might differ though. Here’s how you do this for
any YubiKey that implements the PIV feature:

# ykman piv reset
# ykman piv generate-key -a RSA2048 9d pubkey.pem
# ykman piv generate-certificate --subject "Knobelei" 9d pubkey.pem
# rm pubkey.pem

(This chain of commands erases what was stored in PIV feature of your
token before, be careful!)

For tokens/smartcards from other vendors a different series of
commands might work. Once you have a key pair on it, you can enroll it
with a LUKS2 volume like so:

# systemd-cryptenroll --pkcs11-token-uri=auto /dev/sda5

Just like the same command’s invocation in the FIDO2 case this enrolls
the security token as an additional way to unlock the volume, any
passphrases you already have enrolled remain enrolled.

For the PKCS#11 case you need to edit your /etc/crypttab entry like this:

myvolume /dev/sda5 - pkcs11-uri=auto

If you have a security token that implements both PKCS#11 PIV and
FIDO2 I’d probably enroll it as FIDO2 device, given it’s the more
contemporary, future-proof standard. Moreover, it requires no special
preparation in order to get an RSA key onto the device: FIDO2 keys
typically just work.

Unlocking with TPM2

Most modern (non-budget) PC hardware (and other kind of hardware too)
nowadays comes with a TPM2 security chip. In many ways a TPM2 chip is
a smartcard that is soldered onto the mainboard of your system. Unlike
your usual USB-connected security tokens you thus cannot remove them
from your PC, which means they address quite a different security
scenario: they aren’t immediately comparable to a physical key you can
take with you that unlocks some door, but they are a key you leave at
the door, but that refuses to be turned by anyone but you.

Even though this sounds a lot weaker than the FIDO2/PKCS#11 model TPM2
still bring benefits for securing your systems: because the
cryptographic key material stored in TPM2 devices cannot be extracted
(at least that’s the theory), if you bind your hard disk encryption to
it, it means attackers cannot just copy your disk and analyze it
offline — they always need access to the TPM2 chip too to have a
chance to acquire the necessary cryptographic keys. Thus, they can
still steal your whole PC and analyze it, but they cannot just copy
the disk without you noticing and analyze the copy.

Moreover, you can bind the ability to unlock the harddisk to specific
software versions: for example you could say that only your trusted
Fedora Linux can unlock the device, but not any arbitrary OS some
hacker might boot from a USB stick they plugged in. Thus, if you trust
your OS vendor, you can entrust storage unlocking to the vendor’s OS
together with your TPM2 device, and thus can be reasonably sure
intruders cannot decrypt your data unless they both hack your OS
vendor and steal/break your TPM2 chip.

Here’s how you enroll your LUKS2 volume with your TPM2 chip:

# systemd-cryptenroll --tpm2-device=auto --tpm2-pcrs=7 /dev/sda5

This looks almost as straightforward as the two earlier
sytemd-cryptenroll command lines — if it wasn’t for the
--tpm2-pcrs= part. With that option you can specify to which TPM2
PCRs you want to bind the enrollment. TPM2 PCRs are a set of
(typically 24) hash values that every TPM2 equipped system at boot
calculates from all the software that is invoked during the boot
sequence, in a secure, unfakable way (this is called
“measurement”). If you bind unlocking to a specific value of a
specific PCR you thus require the system has to follow the same
sequence of software at boot to re-acquire the disk encryption
key. Sounds complex? Well, that’s because it is.

For now, let’s see how we have to modify your /etc/crypttab to
unlock via TPM2:

myvolume /dev/sda5 - tpm2-device=auto

This part is easy again: the tpm2-device= option is what tells
systemd-cryptsetup to use the TPM2 metadata from the LUKS2 header
and to wait for the TPM2 device to show up.

Bonus: Recovery Key Enrollment

FIDO2, PKCS#11 and TPM2 security tokens and chips pair well with
recovery keys: since you don’t need to type in your password everyday
anymore it makes sense to get rid of it, and instead enroll a
high-entropy recovery key you then print out or scan off screen and
store a safe, physical location. i.e. forget about good ol’
passphrase-based unlocking, go for FIDO2 plus recovery key instead!
Here’s how you do it:

# systemd-cryptenroll --recovery-key /dev/sda5

This will generate a key, enroll it in the LUKS2 volume, show it to
you on screen and generate a QR code you may scan off screen if you
like. The key has highest entropy, and can be entered wherever you can
enter a passphrase. Because of that you don’t have to modify
/etc/crypttab to make the recovery key work.

Future

There’s still plenty room for further improvement in all of this. In
particular for the TPM2 case: what the text above doesn’t really
mention is that binding your encrypted volume unlocking to specific
software versions (i.e. kernel + initrd + OS versions) actually sucks
hard: if you naively update your system to newer versions you might
lose access to your TPM2 enrolled keys (which isn’t terrible, after
all you did enroll a recovery key — right? — which you then can use
to regain access). To solve this some more integration with
distributions would be necessary: whenever they upgrade the system
they’d have to make sure to enroll the TPM2 again — with the PCR
hashes matching the new version. And whenever they remove an old
version of the system they need to remove the old TPM2
enrollment. Alternatively TPM2 also knows a concept of signed PCR
hash values. In this mode the distro could just ship a set of PCR
signatures which would unlock the TPM2 keys. (But quite frankly I
don’t really see the point: whether you drop in a signature file on
each system update, or enroll a new set of PCR hashes in the LUKS2
header doesn’t make much of a difference). Either way, to make TPM2
enrollment smooth some more integration work with your distribution’s
system update mechanisms need to happen. And yes, because of this OS
updating complexity the example above — where I referenced your trusty
Fedora Linux — doesn’t actually work IRL (yet? hopefully…). Nothing
updates the enrollment automatically after you initially enrolled it,
hence after the first kernel/initrd update you have to manually
re-enroll things again, and again, and again … after every update.

The TPM2 could also be used for other kinds of key policies, we might
look into adding later too. For example, Windows uses TPM2 stuff to
allow short (4 digits or so) “PINs” for unlocking the harddisk,
i.e. kind of a low-entropy password you type in. The reason this is
reasonably safe is that in this case the PIN is passed to the TPM2
which enforces that not more than some limited amount of unlock
attempts may be made within some time frame, and that after too many
attempts the PIN is invalidated altogether. Thus making dictionary
attacks harder (which would normally be easier given the short length
of the PINs).

Postscript

(BTW: Yubico sent me two YubiKeys for testing, Nitrokey a Nitrokey
FIDO2, and AuthenTrend three ATKey.Pro tokens, thank you! — That’s why
you see all those references to YubiKey/Nitrokey/AuthenTrend devices
in the text above: it’s the hardware I had to test this with. That
said, I also tested the FIDO2 stuff with a SoloKey I bought, where it
also worked fine. And yes, you!, other vendors!, who might be reading
this, please send me your security tokens for free, too, and I
might test things with them as well. No promises though. And I am not
going to give them back, if you do, sorry. ;-))

ASG! 2019 CfP Re-Opened!

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/asg-2019-cfp-re-opened.html

The All Systems Go! 2019 Call for Participation Re-Opened for ONE DAY!

Due to popular request we have re-opened the Call for Participation
(CFP) for All Systems Go! 2019 for one
day. It will close again TODAY, on 15 of July 2019, midnight Central
European Summit Time! If you missed the deadline so far, we’d like to
invite you to submit your proposals for consideration to the CFP
submission site
quickly!
(And yes, this is the last extension, there’s not going to be any
more extensions.)

ASG image

All Systems Go! is everybody’s favourite low-level Userspace Linux
conference, taking place in Berlin, Germany in September 20-22, 2019.

For more information please visit our conference
website
!

Walkthrough for Portable Services in Go

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/walkthrough-for-portable-services-in-go.html

Portable Services Walkthrough (Go Edition)

A few months ago I posted a blog story with a walkthrough of systemd
Portable
Services
. The
example service given was written in C, and the image was built with
mkosi. In this blog story I’d
like to revisit the exercise, but this time focus on a different
aspect: modern programming languages like Go and Rust push users a lot
more towards static linking of libraries than the usual dynamic
linking preferred by C (at least in the way C is used by traditional
Linux distributions).

Static linking means we can greatly simplify image building: if we
don’t have to link against shared libraries during runtime we don’t
have to include them in the portable service image. And that means
pretty much all need for building an image from a Linux distribution
of some kind goes away as we’ll have next to no dependencies that
would require us to rely on a distribution package manager or
distribution packages. In fact, as it turns out, we only need as few
as three files in the portable service image to be fully functional.

So, let’s have a closer look how such an image can be put
together. All of the following is available in this git
repository
.

A Simple Go Service

Let’s start with a simple Go service, an HTTP service that simply
counts how often a page from it is requested. Here are the sources:
main.go
— note that I am not a seasoned Go programmer, hence please be
gracious.

The service implements systemd’s socket activation protocol, and thus
can receive bound TCP listener sockets from systemd, using the
$LISTEN_PID and $LISTEN_FDS environment variables.

The service will store the counter data in the directory indicated in
the $STATE_DIRECTORY environment variable, which happens to be an
environment variable current systemd versions set based on the
StateDirectory=
setting in service files.

Two Simple Unit Files

When a service shall be managed by systemd a unit file is
required. Since the service we are putting together shall be socket
activatable, we even have two:
portable-walkthrough-go.service
(the description of the service binary itself) and
portable-walkthrough-go.socket
(the description of the sockets to listen on for the service).

These units are not particularly remarkable: the .service file
primarily contains the command line to invoke and a StateDirectory=
setting to make sure the service when invoked gets its own private
state directory under /var/lib/ (and the $STATE_DIRECTORY
environment variable is set to the resulting path). The .socket file
simply lists 8088 as TCP/IP port to listen on.

An OS Description File

OS images (and that includes portable service images) generally should
include an
os-release
file. Usually, that is provided by the distribution. Since we are
building an image without any distribution let’s write our own
version of such a
file
. Later
on we can use the portablectl inspect command to have a look at this
metadata of our image.

Putting it All Together

The four files described above are already every file we need to build
our image. Let’s now put the portable service image together. For that
I’ve written a
Makefile. It
contains two relevant rules: the first one builds the static binary
from the Go program sources. The second one then puts together a
squashfs file system combining the following:

  1. The compiled, statically linked service binary
  2. The two systemd unit files
  3. The os-release file
  4. A couple of empty directories such as /proc/, /sys/, /dev/
    and so on that need to be over-mounted with the respective kernel
    API file system. We need to create them as empty directories here
    since Linux insists on directories to exist in order to over-mount
    them, and since the image we are building is going to be an
    immutable read-only image (squashfs) these directories cannot be
    created dynamically when the portable image is mounted.
  5. Two empty files /etc/resolv.conf and /etc/machine-id that can
    be over-mounted with the same files from the host.

And that’s already it. After a quick make we’ll have our portable
service image portable-walkthrough-go.raw and are ready to go.

Trying it out

Let’s now attach the portable service image to our host system:

# portablectl attach ./portable-walkthrough-go.raw
(Matching unit files with prefix 'portable-walkthrough-go'.)
Created directory /etc/systemd/system.attached.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.socket.
Created directory /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Written /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Created symlink /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf  /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system.attached/portable-walkthrough-go.service.
Created symlink /etc/portables/portable-walkthrough-go.raw  /home/lennart/projects/portable-walkthrough-go/portable-walkthrough-go.raw.

The portable service image is now attached to the host, which means we
can now go and start it (or even enable it):

# systemctl start portable-walkthrough-go.socket

Let’s see if our little web service works, by doing an HTTP request on port 8088:

# curl localhost:8088
Hello! You are visitor #1!

Let’s try this again, to check if it counts correctly:

# curl localhost:8088
Hello! You are visitor #2!

Nice! It worked. Let’s now stop the service again, and detach the image again:

# systemctl stop portable-walkthrough-go.service portable-walkthrough-go.socket
# portablectl detach portable-walkthrough-go
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/10-profile.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.service.d.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d/20-portable.conf.
Removed /etc/systemd/system.attached/portable-walkthrough-go.socket.d.
Removed /etc/portables/portable-walkthrough-go.raw.
Removed /etc/systemd/system.attached.

And there we go, the portable image file is detached from the host again.

A Couple of Notes

  1. Of course, this is a simplistic example: in real life services will
    be more than one compiled file, even when statically linked. But
    you get the idea, and it’s very easy to extend the example above to
    include any additional, auxiliary files in the portable service
    image.

  2. The service is very nicely sandboxed during runtime: while it runs
    as regular service on the host (and you thus can watch its logs or
    do resource management on it like you would do for all other
    systemd services), it runs in a very restricted environment under a
    dynamically assigned UID that ceases to exist when the service is
    stopped again.

  3. Originally I wanted to make the service not only socket activatable
    but also implement exit-on-idle, i.e. add a logic so that the
    service terminates on its own when there’s no ongoing HTTP
    connection for a while. I couldn’t figure out how to do this
    race-freely in Go though, but I am sure an interested reader might
    want to add that? By combining socket activation with exit-on-idle
    we can turn this project into an excercise of putting together an
    extremely resource-friendly and robust service architecture: the
    service is started only when needed and terminates when no longer
    needed. This would allow to pack services at a much higher density
    even on systems with few resources.

  4. While the basic concepts of portable services have been around
    since systemd 239, it’s best to try the above with systemd 241 or
    newer since the portable service logic received a number of fixes
    since then.

Further Reading

A low-level document introducing Portable Services is shipped along
with systemd
.

Please have a look at the blog story from a few months
ago

that did something very similar with a service written in C.

There are also relevant manual pages:
portablectl(1)
and
systemd-portabled(8).

ASG! 2018 Tickets

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/asg-2018-tickets.html

All Systems Go! 2018 Tickets Selling Out Quickly!

Buy your tickets for All Systems Go!
2018
soon, they are quickly selling out!
The conference takes place on September 28-30, in Berlin, Germany, in
a bit over two weeks.

Why should you attend? If you are interested in low-level Linux
userspace, then All Systems Go! is the right conference for you. It
covers all topics relevant to foundational open-source Linux
technologies. For details on the covered topics see our schedule for day #1
and for day #2.

For more information please visit our conference
website
!

See you in Berlin!

ASG! 2018 CfP Closes TODAY

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/asg-2018-cfp-closes-today.html

The All Systems Go! 2018 Call for Participation Closes TODAY!

The Call for Participation (CFP) for All Systems Go!
2018
will close TODAY, on 30th of
July! We’d like to invite you to submit your proposals for
consideration to the CFP submission
site
quickly!

ASG image

All Systems Go! is everybody’s favourite low-level Userspace Linux
conference, taking place in Berlin, Germany in September 28-30, 2018.

For more information please visit our conference
website
!

ASG! 2018 CfP Closes Soon

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/asg-2018-cfp-closes-soon.html

The All Systems Go! 2018 Call for Participation Closes in One Week!

The Call for Participation (CFP) for All Systems Go!
2018
will close in one week, on 30th of
July! We’d like to invite you to submit your proposals for
consideration to the CFP submission
site
quickly!

ASG image

Notification of acceptance and non-acceptance will go out within 7
days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are
welcome. In particular, however, we are looking for proposals
including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things,
talks about kernel projects are welcome, as long as they have a clear
and direct relevance for user-space.

For more information please visit our conference
website
!

Walkthrough for Portable Services

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/walkthrough-for-portable-services.html

Portable Services with systemd v239

systemd
v239

contains a great number of new features. One of them is first class
support for Portable
Services
. In this blog story
I’d like to shed some light on what they are and why they might be
interesting for your application.

What are “Portable Services”?

The “Portable Service” concept takes inspiration from classic
chroot() environments as well as container management and brings a
number of their features to more regular system service management.

While the definition of what a “container” really is is hotly debated,
I figure people can generally agree that the “container” concept
primarily provides two major features:

  1. Resource bundling: a container generally brings its own file system
    tree along, bundling any shared libraries and other resources it
    might need along with the main service executables.

  2. Isolation and sand-boxing: a container operates in a name-spaced
    environment that is relatively detached from the host. Besides
    living in its own file system namespace it usually also has its own
    user database, process tree and so on. Access from the container to
    the host is limited with various security technologies.

Of these two concepts the first one is also what traditional UNIX
chroot() environments are about.

Both resource bundling and isolation/sand-boxing are concepts systemd
has implemented to varying degrees for a longer time. Specifically,
RootDirectory=
and
RootImage=
have been around for a long time, and so have been the various
sand-boxing
features

systemd provides. The Portable Services concept builds on that,
putting these features together in a new, integrated way to make them
more accessible and usable.

OK, so what precisely is a “Portable Service”?

Much like a container image, a portable service on disk can be just a
directory tree that contains service executables and all their
dependencies, in a hierarchy resembling the normal Linux directory
hierarchy. A portable service can also be a raw disk image, containing
a file system containing such a tree (which can be mounted via a
loop-back block device), or multiple file systems (in which case they
need to follow the Discoverable Partitions
Specification

and be located within a GPT partition table). Regardless whether the
portable service on disk is a simple directory tree or a raw disk
image, let’s call this concept the portable service image.

Such images can be generated with any tool typically used for the
purpose of installing OSes inside some directory, for example dnf
--installroot=
or debootstrap. There are very few requirements made
on these trees, except the following two:

  1. The tree should carry systemd unit
    files

    for relevant services in them.

  2. The tree should carry
    /usr/lib/os-release
    (or /etc/os-release) OS release information.

Of course, as you might notice, OS trees generated from any of today’s
big distributions generally qualify for these two requirements without
any further modification, as pretty much all of them adopted
/usr/lib/os-release and tend to ship their major services with
systemd unit files.

A portable service image generated like this can be “attached” or
“detached” from a host:

  1. “Attaching” an image to a host is done through the new
    portablectl
    attach

    command. This command dissects the image, reading the os-release
    information, and searching for unit files in them. It then copies
    relevant unit files out of the images and into
    /etc/systemd/system/. After that it augments any copied service
    unit files in two ways: a drop-in adding a RootDirectory= or
    RootImage= line is added in so that even though the unit files
    are now available on the host when started they run the referenced
    binaries from the image. It also symlinks in a second drop-in which
    is called a “profile”, which is supposed to carry additional
    security settings to enforce on the attached services, to ensure
    the right amount of sand-boxing.

  2. “Detaching” an image from the host is done through portable
    detach
    . It reverses the steps above: the unit files copied out are
    removed again, and so are the two drop-in files generated for them.

While a portable service is attached its relevant unit files are made
available on the host like any others: they will appear in systemctl
list-unit-files
, you can enable and disable them, you can start them
and stop them. You can extend them with systemctl edit. You can
introspect them. You can apply resource management to them like to any
other service, and you can process their logs like any other service
and so on. That’s because they really are native systemd services,
except that they have ‘twist’ if you so will: they have tougher
security by default and store their resources in a root directory or
image.

And that’s already the essence of what Portable Services are.

A couple of interesting points:

  1. Even though the focus is on shipping service unit files in
    portable service images, you can actually ship timer units, socket
    units, target units, path units in portable services too. This
    means you can very naturally do time, socket and path based
    activation. It’s also entirely fine to ship multiple service units
    in the same image, in case you have more complex applications.

  2. This concept introduces zero new metadata. Unit files are an
    existing concept, as are os-release files, and — in case you opt
    for raw disk images — GPT partition tables are already established
    too. This also means existing tools to generate images can be
    reused for building portable service images to a large degree as no
    completely new artifact types need to be generated.

  3. Because the Portable Service concepts introduces zero new metadata
    and just builds on existing security and resource bundling
    features of systemd it’s implemented in a set of distinct tools,
    relatively disconnected from the rest of systemd. Specifically, the
    main user-facing command is
    portablectl,
    and the actual operations are implemented in
    systemd-portabled.service. If
    you so will, portable services are a true add-on to systemd, just
    making a specific work-flow nicer to use than with the basic
    operations systemd otherwise provides. Also note that
    systemd-portabled provides bus APIs accessible to any program
    that wants to interface with it, portablectl is just one tool
    that happens to be shipped along with systemd.

  4. Since Portable Services are a feature we only added very recently
    we wanted to keep some freedom to make changes still. Due to that
    we decided to install the portablectl command into
    /usr/lib/systemd/ for now, so that it does not appear in $PATH
    by default. This means, for now you have to invoke it with a full
    path: /usr/lib/systemd/portablectl. We expect to move it into
    /usr/bin/ very soon though, and make it a fully supported
    interface of systemd.

  5. You may wonder which unit files contained in a portable service
    image are the ones considered “relevant” and are actually copied
    out by the portablectl attach operation. Currently, this is
    derived from the image name. Let’s say you have an image stored in
    a directory /var/lib/portables/foobar_4711/ (or alternatively in
    a raw image /var/lib/portables/foobar_4711.raw). In that case the
    unit files copied out match the pattern foobar*.service,
    foobar*.socket, foobar*.target, foobar*.path,
    foobar*.timer.

  6. The Portable Services concept does not define any specific method
    how images get on the deployment machines, that’s entirely up to
    administrators. You can just scp them there, or wget them. You
    could even package them as RPMs and then deploy them with dnf if
    you feel adventurous.

  7. Portable service images can reside in any directory you
    like. However, if you place them in /var/lib/portables/ then
    portablectl will find them easily and can show you a list of
    images you can attach and suchlike.

  8. Attaching a portable service image can be done persistently, so
    that it remains attached on subsequent boots (which is the default),
    or it can be attached only until the next reboot, by passing
    --runtime to portablectl.

  9. Because portable service images are ultimately just regular OS
    images, it’s natural and easy to build a single image that can be
    used in three different ways:

    1. It can be attached to any host as a portable service image.

    2. It can be booted as OS container, for example in a container
      manager like systemd-nspawn.

    3. It can be booted as host system, for example on bare metal or
      in a VM manager.

    Of course, to qualify for the latter two the image needs to
    contain more than just the service binaries, the os-release file
    and the unit files. To be bootable an OS container manager such as
    systemd-nspawn the image needs to contain an init system of some
    form, for example
    systemd. To
    be bootable on bare metal or as VM it also needs a boot loader of
    some form, for example
    systemd-boot.

Profiles

In the previous section the “profile” concept was briefly
mentioned. Since they are a major feature of the Portable Services
concept, they deserve some focus. A “profile” is ultimately just a
pre-defined drop-in file for unit files that are attached to a
host. They are supposed to mostly contain sand-boxing and security
settings, but may actually contain any other settings, too. When a
portable service is attached a suitable profile has to be selected. If
none is selected explicitly, the default profile called default is
used. systemd ships with four different profiles out of the box:

  1. The
    default
    profile provides a medium level of security. It contains settings to
    drop capabilities, enforce system call filters, restrict many kernel
    interfaces and mount various file systems read-only.

  2. The
    strict
    profile is similar to the default profile, but generally uses the
    most restrictive sand-boxing settings. For example networking is turned
    off and access to AF_NETLINK sockets is prohibited.

  3. The
    trusted
    profile is the least strict of them all. In fact it makes almost no
    restrictions at all. A service run with this profile has basically
    full access to the host system.

  4. The
    nonetwork
    profile is mostly identical to default, but also turns off network access.

Note that the profile is selected at the time the portable service
image is attached, and it applies to all service files attached, in
case multiple are shipped in the same image. Thus, the sand-boxing
restriction to enforce are selected by the administrator attaching the
image and not the image vendor.

Additional profiles can be defined easily by the administrator, if
needed. We might also add additional profiles sooner or later to be
shipped with systemd out of the box.

What’s the use-case for this? If I have containers, why should I bother?

Portable Services are primarily intended to cover use-cases where code
should more feel like “extensions” to the host system rather than live
in disconnected, separate worlds. The profile concept is
supposed to be tunable to the exact right amount of integration or
isolation needed for an application.

In the container world the concept of “super-privileged containers”
has been touted a lot, i.e. containers that run with full
privileges. It’s precisely that use-case that portable services are
intended for: extensions to the host OS, that default to isolation,
but can optionally get as much access to the host as needed, and can
naturally take benefit of the full functionality of the host. The
concept should hence be useful for all kinds of low-level system
software that isn’t shipped with the OS itself but needs varying
degrees of integration with it. Besides servers and appliances this
should be particularly interesting for IoT and embedded devices.

Because portable services are just a relatively small extension to the
way system services are otherwise managed, they can be treated like
regular service for almost all use-cases: they will appear along
regular services in all tools that can introspect systemd unit data,
and can be managed the same way when it comes to logging, resource
management, runtime life-cycles and so on.

Portable services are a very generic concept. While the original
use-case is OS extensions, it’s of course entirely up to you and other
users to use them in a suitable way of your choice.

Walkthrough

Let’s have a look how this all can be used. We’ll start with building
a portable service image from scratch, before we attach, enable and
start it on a host.

Building a Portable Service image

As mentioned, you can use any tool you like that can create OS trees
or raw images for building Portable Service images, for example
debootstrap or dnf --installroot=. For this example walkthrough
run we’ll use mkosi, which is
ultimately just a fancy wrapper around dnf and debootstrap but
makes a number of things particularly easy when repetitively building
images from source trees.

I have pushed everything necessary to reproduce this walkthrough
locally to a GitHub
repository
. Let’s check it out:

$ git clone https://github.com/systemd/portable-walkthrough.git

Let’s have a look in the repository:

  1. First of all,
    walkthroughd.c
    is the main source file of our little service. To keep things
    simple it’s written in C, but it could be in any language of your
    choice. The daemon as implemented won’t do much: it just starts up
    and waits for SIGTERM, at which point it will shut down. It’s
    ultimately useless, but hopefully illustrates how this all fits
    together. The C code has no dependencies besides libc.

  2. walkthroughd.service
    is a systemd unit file that starts our little daemon. It’s a simple
    service, hence the unit file is trivial.

  3. Makefile
    is a short make build script to build the daemon binary. It’s
    pretty trivial, too: it just takes the C file and builds a binary
    from it. It can also install the daemon. It places the binary in
    /usr/local/lib/walkthroughd/walkthroughd (why not in
    /usr/local/bin? because it’s not a user-facing binary but a system
    service binary), and its unit file in
    /usr/local/lib/systemd/walkthroughd.service. If you want to test
    the daemon on the host we can now simply run make and then
    ./walkthroughd in order to check everything works.

  4. mkosi.default
    is file that tells mkosi how to build the image. We opt for a
    Fedora-based image here (but we might as well have used Debian
    here, or any other supported distribution). We need no particular
    packages during runtime (after all we only depend on libc), but
    during the build phase we need gcc and make, hence these are the
    only packages we list in BuildPackages=.

  5. mkosi.build
    is a shell script that is invoked during mkosi’s build logic. All
    it does is invoke make and make install to build and install
    our little daemon, and afterwards it extends the
    distribution-supplied /etc/os-release file with an additional
    field that describes our portable service a bit.

Let’s now use this to build the portable service image. For that we
use the mkosi tool. It’s
sufficient to invoke it without parameter to build the first image: it
will automatically discover mkosi.default and mkosi.build which
tells it what to do. (Note that if you work on a project like this for
a longer time, mkosi -if is probably the better command to use, as
it that speeds up building substantially by using an incremental build
mode). mkosi will download the necessary RPMs, and put them all
together. It will build our little daemon inside the image and after
all that’s done it will output the resulting image:
walkthroughd_1.raw.

Because we opted to build a GPT raw disk image in mkosi.default this
file is actually a raw disk image containing a GPT partition
table. You can use fdisk -l walkthroughd_1.raw to enumerate the
partition table. You can also use systemd-nspawn -i
walkthroughd_1.raw
to explore the image quickly if you need.

Using the Portable Service Image

Now that we have a portable service image, let’s see how we can
attach, enable and start the service included within it.

First, let’s attach the image:

# /usr/lib/systemd/portablectl attach ./walkthroughd_1.raw
(Matching unit files with prefix 'walkthroughd'.)
Created directory /etc/systemd/system/walkthroughd.service.d.
Written /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Created symlink /etc/systemd/system/walkthroughd.service.d/10-profile.conf → /usr/lib/systemd/portable/profile/default/service.conf.
Copied /etc/systemd/system/walkthroughd.service.
Created symlink /etc/portables/walkthroughd_1.raw → /home/lennart/projects/portable-walkthrough/walkthroughd_1.raw.

The command will show you exactly what is has been doing: it just
copied the main service file out, and added the two drop-ins, as
expected.

Let’s see if the unit is now available on the host, just like a regular unit, as promised:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: inactive (dead)

Nice, it worked. We see that the unit file is available and that
systemd correctly discovered the two drop-ins. The unit is neither
enabled nor started however. Yes, attaching a portable service image
doesn’t imply enabling nor starting. It just means the unit files
contained in the image are made available to the host. It’s up to the
administrator to then enable them (so that they are automatically
started when needed, for example at boot), and/or start them (in case
they shall run right-away).

Let’s now enable and start the service in one step:

# systemctl enable --now walkthroughd.service
Created symlink /etc/systemd/system/multi-user.target.wants/walkthroughd.service → /etc/systemd/system/walkthroughd.service.

Let’s check if it’s running:

# systemctl status walkthroughd.service
● walkthroughd.service - A simple example service
   Loaded: loaded (/etc/systemd/system/walkthroughd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/walkthroughd.service.d
           └─10-profile.conf, 20-portable.conf
   Active: active (running) since Wed 2018-06-27 17:55:30 CEST; 4s ago
 Main PID: 45003 (walkthroughd)
    Tasks: 1 (limit: 4915)
   Memory: 4.3M
   CGroup: /system.slice/walkthroughd.service
           └─45003 /usr/local/lib/walkthroughd/walkthroughd

Jun 27 17:55:30 sigma walkthroughd[45003]: Initializing.

Perfect! We can see that the service is now enabled and running. The daemon is running as PID 45003.

Now that we verified that all is good, let’s stop, disable and detach the service again:

# systemctl disable --now walkthroughd.service
Removed /etc/systemd/system/multi-user.target.wants/walkthroughd.service.
# /usr/lib/systemd/portablectl detach ./walkthroughd_1.raw
Removed /etc/systemd/system/walkthroughd.service.
Removed /etc/systemd/system/walkthroughd.service.d/10-profile.conf.
Removed /etc/systemd/system/walkthroughd.service.d/20-portable.conf.
Removed /etc/systemd/system/walkthroughd.service.d.
Removed /etc/portables/walkthroughd_1.raw.

And finally, let’s see that it’s really gone:

# systemctl status walkthroughd
Unit walkthroughd.service could not be found.

Perfect! It worked!

I hope the above gets you started with Portable Services. If you have
further questions, please contact our mailing
list
.

Further Reading

A more low-level document explaining details is shipped
along with systemd
.

There are also relevant manual pages:
portablectl(1)
and
systemd-portabled(8).

For further information about mkosi see its homepage.

All Systems Go! 2018 CfP Open

Post Syndicated from Lennart Poettering original https://0pointer.net/blog/all-systems-go-2018-cfp-open.html

The All Systems Go! 2018 Call for Participation is Now Open!

The Call for Participation (CFP) for All Systems Go!
2018
is now open. We’d like to invite you
to submit your proposals for consideration to the CFP submission
site
.

ASG image

The CFP will close on July 30th. Notification of acceptance and
non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are
welcome. In particular, however, we are looking for proposals
including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things,
talks about kernel projects are welcome, as long as they have a clear
and direct relevance for user-space.

For more information please visit our conference
website
!