Notes about virtual machine and container hosting.

  1. KVM bootstrap with libvirt
    1. Bridge configuration
    2. NAT configuration
    3. Base image build
      1. Autopkg builders
      2. Official images
    4. Virtual machine creation
      1. IP address discovery
    5. Maintenance
    6. Connecting to a remote libvirt instance
    7. Remaining tasks
    8. References
  2. Container notes
    1. Docker
      1. Restarting containers
    2. Rocket

This overlaps with my work on sbuild-qemu, which has its own way of provisionning virtual machines.]]

TODO: merge the above with this page.

KVM bootstrap with libvirt

I got tired of dealing with VirtualBox and Vagrant: those tools work well, but they are too far from datacenter-level hosting primitives, which right now converge towards KVM (or maybe Xen, but that didn't seem to recover from the Meltdown attacks). VirtualBox was also not shipped in stretch because "upstream doesn't play in a really fair mode wrt CVEs" and simply ship updates in bulk.

So I started looking into KVM. It seems a common way to get started with this without setting up a whole cluster management system (e.g. Ganeti) is to use libvirt. The instructions here also include bridge setup information for Debian stretch since that makes it easier to host services inside the virtual machines than a clunky NAT setup.

Bridge configuration

Assuming the local Ethernet interface is called eno1, the following configuration, in /etc/network/interfaces.d/br0, enables a bridge on the host:

iface eno1 inet manual

auto br0
iface br0 inet static
    # really necessary?
    #hwaddress ether f4:4d:30:66:14:9a
    address 192.168.0.7
    netmask 255.255.255.0
    gateway 192.168.0.1
    dns-nameservers 8.8.8.8

    bridge_ports eno1

iface br0 inet6 auto

Then disable other networking interfaces and enable the bridge:

ifdown eno1
service NetworkManager restart
ifup br0

Finally, by default Linux bridges disable forwarding through the firewall. This works independently of the net.ipv[46].conf.all.forwarding setting, which should stay turned off unless we actually want to route packets for the network (as opposed to the guests). This can be tweaked by talking with iptables directly:

iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT

Or, preferably, by disabling the firewall on the bridge completely. This can be done by adding this to /etc/sysctl.d/br0-nf-disable.conf:

net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0

This was discovered in the libvirt wiki.

NAT configuration

The default configuration in libvirtd is a "NAT" configuration. That, in turn, injects firewall rules in the kernel when the "network" is started, to rewrite packets going in and out of the VM. dnsmasq is used for DNS and DHCP as well.

I had quite a battle with this network on my laptop, angela. At first nothing was getting through: IPv6 SLAAC configuration was working, but not DHCP. This was hanging the VM which led me to switch to systemd-networkd (see boot time optimizations). That didn't fix networking but boot would at least not hang for a full minute while DHCP failed.

Then the fix was to add a subset of the Puppet module's NFT ruleset, through this commit:

class { 'nftables::rules::qemu':
  masquerade => false,
}

That created the following patch on the ruleset:

--- /etc/nftables/puppet/inet-filter-chain-default_in.nft   2023-11-28 15:47:58.143874297 -0500
+++ /tmp/puppet-file20231128-15717-utsqvt   2023-11-28 15:59:57.891321815 -0500
@@ -6,6 +6,12 @@
   ip6 nexthdr ipv6-icmp accept
 #   Start of fragment order:50 rulename:default_in-avahi_udp
   ip saddr { 0.0.0.0/0 } udp dport 5353 accept
+#   Start of fragment order:50 rulename:default_in-qemu_dhcpv4
+  iifname "virbr0" meta l4proto udp udp dport 67 accept
+#   Start of fragment order:50 rulename:default_in-qemu_tcp_dns
+  iifname "virbr0" tcp dport 53 accept
+#   Start of fragment order:50 rulename:default_in-qemu_udp_dns
+  iifname "virbr0" udp dport 53 accept
 #   Start of fragment order:50 rulename:default_in-ssh
   tcp dport {22} accept
 #   Start of fragment order:50 rulename:default_in-syncthing


--- /etc/nftables/puppet/inet-filter-chain-default_fwd.nft  2023-11-28 15:47:58.151874290 -0500
+++ /tmp/puppet-file20231128-15717-rv4jlv   2023-11-28 15:59:57.903321806 -0500
@@ -1,4 +1,10 @@
 # Start of fragment order:00 default_fwd header
 chain default_fwd {
+#   Start of fragment order:50 rulename:default_fwd-qemu_iip_v4
+  iifname "virbr0" ip saddr 192.168.122.0/24 accept
+#   Start of fragment order:50 rulename:default_fwd-qemu_io_internal
+  iifname "virbr0" oifname "virbr0" accept
+#   Start of fragment order:50 rulename:default_fwd-qemu_oip_v4
+  oifname "virbr0" ip daddr 192.168.122.0/24 ct state related,established accept
 # Start of fragment order:99 default_fwd footer
 }

Note that the network range matters here, it needs to match the one visible in the output of:

virsh net-dumpxml default

Also note that I previously included the nftables::rules::qemu class as is, but that broke virtd networking with this error:

error: internal error: Failed to apply firewall rules /usr/sbin/iptables -w --table nat --list-rules: # Warning: iptables-legacy tables present, use iptables-legacy to see them

The solution was to do the above masquerade => false. Or, in a diff:

--- /etc/nftables/puppet/ip-nat-chain-POSTROUTING.nft   2023-11-28 14:55:32.881506364 -0500
+++ /tmp/puppet-file20231128-9849-fc3war    2023-11-28 15:47:58.163874281 -0500
@@ -4,15 +4,5 @@
   type nat hook postrouting priority 100
 #   Start of fragment order:02 rulename:POSTROUTING-policy
   policy accept
-#   Start of fragment order:50 rulename:POSTROUTING-qemu_ignore_broadcast
-  ip saddr 192.168.122.0/24 ip daddr 255.255.255.255 return
-#   Start of fragment order:50 rulename:POSTROUTING-qemu_ignore_multicast
-  ip saddr 192.168.122.0/24 ip daddr 224.0.0.0/24 return
-#   Start of fragment order:50 rulename:POSTROUTING-qemu_masq_ip
-  ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade
-#   Start of fragment order:50 rulename:POSTROUTING-qemu_masq_tcp
-  meta l4proto tcp ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade to :1024-65535
-#   Start of fragment order:50 rulename:POSTROUTING-qemu_masq_udp
-  meta l4proto udp ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade to :1024-65535
 # Start of fragment order:99 POSTROUTING footer
 }

The IP address distributed by dnsmasq also doesn't seem quite correct, as it's trying to reach 10.0.2.3 for some reason. I had to do this for DNS to work:

echo nameserver 192.168.122.1 > /etc/resolv.conf

This might be solved by hardcoding a DNS server in systemd-networkd or elsewhere.

IPv6 is configured by default, so if you're on a IPv4-only network, some problems are likely to occur. The fix is to edit the network and remove the <ip> block for IPv6:

service libvirtd stop
virsh net-destroy default
virsh net-edit default
virsh net-start default
service libvirtd start

Base image build

Then we can build an image using virt-builder:

virt-builder debian-9 --size=10G --format qcow2 \
  -o /var/lib/libvirt/images/stretch-amd64.qcow2 \
  --update \
  --firstboot-command "dpkg-reconfigure openssh-server" \
  --network --edit /etc/network/interfaces:s/ens2/ens3/ \
  --ssh-inject root:string:'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC7CY6+aTLlk6epl1+TK6wIaHg1fageEfmKFgn+Yov+2lKFIhNRkcWznQVcyViVmC7iaZkEIei1gP9+0lrsdhewtTBjvkDNxR18aIORJsiH95FFjFIuJ0HQjrM1jOxiXhQZ0xLlnhFkxxa8j9l52HTutpYUU63e3lvY0CBuqh7QtkH3un7iT6EaqMR34yFa2ym35ag8ugMbczBwnTDJYn3qpL8gKuw3JnIp+qdSQb1sGdLcC4JN02E2/IY7iw8lzM9xVab1IgvemCJwS0C/Bt9LsmhCy9AMpaVFaAYjepgdBpSqIMa/8VcoVOrhdJWfIc7fLtt+njN1qojsPmuhsr1n' \
  --hostname stretch-amd64 --timezone UTC

This is not ideal, as it fetches the base image from libguestfs.org, in the clear (as opposed to debian.org infrastructure):

[   1.9] Downloading: http://libguestfs.org/download/builder/debian-9.xz

There is, fortunately, an OpenPGP signature on those images but it might be better to bootstrap using debootstrap (although bootstrapping using the above might be much faster).

Also notice how we edit the interfaces file to fix the interface name. For some reason, the interface detected by virt-builder isn't the same that shows up when running with virt-install, below. The symlink trick does not work: adding --link /dev/null:/etc/systemd/network/99-default.link to the virt-builder incantation does not disable those funky interface names. So we simply rewrite the file.

Finally, we inject our SSH key in the root account. The build process will show a root password but we won't need it thanks to that.

If the build fails with this error:

[ 156.9] Resizing (using virt-resize) to expand the disk to 10.0G
virt-resize: error: libguestfs error: /usr/bin/supermin exited with error 
status 1.

It might be that you ran out of space in /var/tmp. You can use TMPDIR to switch to a larger directory.

Autopkg builders

Images can also be built thanks to autopkgtest which itself delegates the job to vmdb2, with something like:

sudo autopkgtest-build-qemu stable /var/lib/libvirt/images/debian9-amd64-autopkgtest.qcow2

There are obviously many, many more options for building such images, that's just the ones I found the most practical.

Official images

An alternative way of getting a base image is to just download images from https://cloud.debian.org/. They have QCOW2 images that are minimal and can serve as a template for multiple VMs. For example, this downloads the latest bookworm build:

cd /var/lib/libvirt/images
curl -L -O https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-nocloud-arm64-daily.qcow2

The above is a really bare image. You might want cloud-init to make your life easier:

curl -L -O https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-genericcloud-arm64-daily.qcow2

Virtual machine creation

Then the virtual machine can be created and started with:

virt-install --import --noautoconsole \
  --memory 1024 \
  --name debian-12-amd64-test \
  --disk path=/var/lib/libvirt/images/debian-12-amd64-test.img 

The path argument can be simplified by using existing volume pools, which can be listed with:

# virsh pool-list
 Name                 State      Autostart 
-------------------------------------------
 boot-scratch         active     yes
 default              active     yes

[[!tip """Notice how the virsh command is called as root. That's not absolutely necessary, but by default when called as a user, it will connect to the user-specific session (qemu:///session) instead of the system-level one (qemu:///system). This can be worked around by using the --connect qemu:///system argument or by changing the default URI.

The actual path of the volume pool can be found with:

# virsh pool-dumpxml default | grep path
<path>/var/lib/libvirt/images</path>

Then a machine can be created in the pool with the --disk vol=default/debian-12-amd64-test.qcow2 argument.

Note that the virtual machine will directly write to the qcow image file. To work on a temporary file, you can create one with:

cd /var/lib/libvirt/images/
qemu-img create -f qcow2 -o backing_file=debian-12-nocloud-amd64-daily.qcow2,backing_fmt=qcow2 debian-12-amd64-test.img 10G

This guide was previously suggesting this command, which doesn't seem to work anymore, to test:

qemu-img create -f qcow2 -b debian9-amd64-autopkgtest.qcow2 overlay.img

IP address discovery

The VM will be created with an IP address allocated by the DHCP server. The latter logs (or tcpdump -n -i any -s 1500 '(port 67 or port 68)') will show the IP address, otherwise the root password will be necessary to discover it.

Alternatively, the IPv6 address of the guest can be deduced from the IP address of the host's vnet0 interface. For example, here's the interface as viewed from the host:

45: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UNKNOWN group default qlen 1000
    link/ether fe:54:00:1e:c2:48 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fc54:ff:fe1e:c248/64 scope link 
       valid_lft forever preferred_lft forever

And from the guest:

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:1e:c2:48 brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.216/24 brd 192.168.0.255 scope global ens3
       valid_lft forever preferred_lft forever
    inet6 fd05:5f2d:569f:0:5054:ff:fe1e:c248/64 scope global mngtmpaddr dynamic 
       valid_lft 7054sec preferred_lft 1654sec
    inet6 2607:f2c0:f00f:8f00:5054:ff:fe1e:c248/64 scope global mngtmpaddr dynamic 
       valid_lft 7054sec preferred_lft 1654sec
    inet6 fe80::5054:ff:fe1e:c248/64 scope link 
       valid_lft forever preferred_lft forever

Notice how the MAC addresses are almost identical? Only the prefix differ: fe on the host and 52 on the guest. This might be used to guess the IPv6 IP of the guest to administer the machine. The local segment IPv6 multicast address (ff02::1) can be used to confirm the IP address:

# ping6 -I br0 ff02::1
ping6: Warning: source address might be selected on device other than br0.
PING ff02::1(ff02::1) from :: br0: 56 data bytes
[...]
64 bytes from fe80::5054:ff:fe1e:c248%br0: icmp_seq=1 ttl=64 time=0.281 ms (DUP!)
[...]
^C
--- ff02::1 ping statistics ---
1 packets transmitted, 1 received, +4 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.049/0.339/0.515/0.166 ms

That latter MAC address is also known by libvirt so this command will show the right MAC:

# virsh domiflist stretch-amd64
Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet0      bridge     br0        virtio      52:54:00:55:44:73

And obviously, connecting to the console and running ip a will show the right IP address, see below for console usage.

Note that netfilter might be firewalling the bridge. To disable, use:

sysctl net.bridge.bridge-nf-call-ip6tables=0
sysctl net.bridge.bridge-nf-call-iptables=0
sysctl net.bridge.bridge-nf-call-arptables=0

See also the sbuild / qemu blog post for details on how to integrate sbuild images with libvirt.

Maintenance

List running VMs:

virsh list

To start a VM:

virsh start stretch-amd64

Get a console:

virsh console stretch-amd64

To stop a VM:

virsh shutdown stretch-amd64

Start a VM already created:

virsh start stretch-amd64

To kill a VM that's hung:

virsh destroy stretch-amd64

To reinstall a VM, the machine needs to be stopped (above) and the namespace reclaimed (source):

virsh undefine stretch-amd64

Connecting to a remote libvirt instance

Assuming that (a) your user can run commands like virsh list and (b) you can access that user using SSH, you can actually manage a remote libvirt server with virt-manager or virsh using remote URIs. For example, this will connect to the remote libvirt machine using virsh:

virsh -c qemu+ssh://user@example.com/system list

A similar URL can be used in virt-manager, which allows you to connect to the remote console easily, for example. Pretty neat.

Remaining tasks

References

Container notes

Those are notes and reminders of how to do "things" with containers, regardless of technology. The are not a replacement for the official documentation and may only be useful for myself.

Docker

To build an image:

docker build --tag foo

That will create an image named "foo" (even if it says --tag, that's actually the image name, whatever).

To enter a container:

docker run --tty --interactive foo /bin/bash

To map volumes to containers, which images pre-define certain VOLUME, first create a volume:

docker volume create foo

Then use it in the container:

docker run --volume foo:/srv/foo /bin/bash

Containers are basically a directory stored in /var/lib/docker/volumes which can be copied around normally.

To restart a container on reboot, use --restart=unless-stopped or --restart=always, as documented.

Restarting containers

A common problem I have is I forget how I started a given container. When it's stopped or crashed or upgraded, I don't know how to restart it with the same arguments. There's docker inspect that will tell me the arguments passed to the container, but not flags like environment variables, mountpoints. Those can be deduced from the JSON output, but it's unclear what's default and what was actually specified by hand.

For this, the runlike tool is useful:

# docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike grafana
docker run --name=grafana --hostname=dd2130c9306c --user=grafana --env="GF_METRICS_ENABLED=true" --env="GF_ANALYTICS_REPORTING_ENABLED=false" --env="GF_USERS_ALLOW_SIGN_UP=false" --env="GF_ALTERTING_ENABLED=false" --env="PATH=/usr/share/grafana/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" --env="GF_PATHS_CONFIG=/etc/grafana/grafana.ini" --env="GF_PATHS_DATA=/var/lib/grafana" --env="GF_PATHS_HOME=/usr/share/grafana" --env="GF_PATHS_LOGS=/var/log/grafana" --env="GF_PATHS_PLUGINS=/var/lib/grafana/plugins" --env="GF_PATHS_PROVISIONING=/etc/grafana/provisioning" --volume="grafana-storage:/var/lib/grafana" -p 3000:3000 --restart=unless-stopped --detach=true grafana/grafana

It may be a little verbose, but it's a good basis to restart a container. The correct incantation turns out to be:

docker run --name=grafana --user=grafana --env="GF_METRICS_ENABLED=true" --env="GF_ANALYTICS_REPORTING_ENABLED=false" --env="GF_USERS_ALLOW_SIGN_UP=false" --env="GF_ALTERTING_ENABLED=false" --volume="grafana-storage:/var/lib/grafana" -p 3000:3000 --restart=unless-stopped grafana/grafana

For now I'm storing the canonical commandline in a "start-$image" script (e.g. start-airsonic, start-grafana) but that seems suboptimal.

Rocket

Running docker containers:

$ sudo rkt run --insecure-options=image --interactive docker://busybox -- /bin/sh

Those get resolved using the rkt image resolution.

Re-running:

$ sudo rkt run registry-1.docker.io/library/debian:latest --interactive --exec /bin/bash --net=host

Building images requires using the separate acbuild command which builds "standard" ACI images and not docker images. Other tools are available like Packer, umoci or Buildah, although only Buildah can use Dockerfiles to build images.

Created . Edited .