hosting
Notes about virtual machine and container hosting.
This overlaps with my work on sbuild-qemu, which has its own way of provisionning virtual machines.]]
KVM bootstrap with libvirt
I got tired of dealing with VirtualBox and Vagrant: those tools work well, but they are too far from datacenter-level hosting primitives, which right now converge towards KVM (or maybe Xen, but that didn't seem to recover from the Meltdown attacks). VirtualBox was also not shipped in stretch because "upstream doesn't play in a really fair mode wrt CVEs" and simply ship updates in bulk.
So I started looking into KVM. It seems a common way to get started with this without setting up a whole cluster management system (e.g. Ganeti) is to use libvirt. The instructions here also include bridge setup information for Debian stretch since that makes it easier to host services inside the virtual machines than a clunky NAT setup.
Bridge configuration
Assuming the local Ethernet interface is called eno1
, the following
configuration, in /etc/network/interfaces.d/br0
, enables a bridge on
the host:
iface eno1 inet manual
auto br0
iface br0 inet static
# really necessary?
#hwaddress ether f4:4d:30:66:14:9a
address 192.168.0.7
netmask 255.255.255.0
gateway 192.168.0.1
dns-nameservers 8.8.8.8
bridge_ports eno1
iface br0 inet6 auto
Then disable other networking interfaces and enable the bridge:
ifdown eno1
service NetworkManager restart
ifup br0
Finally, by default Linux bridges disable forwarding through the
firewall. This works independently of the
net.ipv[46].conf.all.forwarding
setting, which should stay turned
off unless we actually want to route packets for the network (as
opposed to the guests). This can be tweaked by talking with
iptables
directly:
iptables -I FORWARD -m physdev --physdev-is-bridged -j ACCEPT
Or, preferably, by disabling the firewall on the bridge
completely. This can be done by adding this to
/etc/sysctl.d/br0-nf-disable.conf
:
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
This was discovered in the libvirt wiki.
NAT configuration
The default configuration in libvirtd is a "NAT"
configuration. That, in turn, injects firewall rules in the kernel
when the "network" is started, to rewrite packets going in and out of
the VM. dnsmasq
is used for DNS and DHCP as well.
I had quite a battle with this network on my laptop, angela. At first nothing was getting through: IPv6 SLAAC configuration was working, but not DHCP. This was hanging the VM which led me to switch to systemd-networkd (see boot time optimizations). That didn't fix networking but boot would at least not hang for a full minute while DHCP failed.
Then the fix was to add a subset of the Puppet module's NFT ruleset, through this commit:
class { 'nftables::rules::qemu':
masquerade => false,
}
That created the following patch on the ruleset:
--- /etc/nftables/puppet/inet-filter-chain-default_in.nft 2023-11-28 15:47:58.143874297 -0500
+++ /tmp/puppet-file20231128-15717-utsqvt 2023-11-28 15:59:57.891321815 -0500
@@ -6,6 +6,12 @@
ip6 nexthdr ipv6-icmp accept
# Start of fragment order:50 rulename:default_in-avahi_udp
ip saddr { 0.0.0.0/0 } udp dport 5353 accept
+# Start of fragment order:50 rulename:default_in-qemu_dhcpv4
+ iifname "virbr0" meta l4proto udp udp dport 67 accept
+# Start of fragment order:50 rulename:default_in-qemu_tcp_dns
+ iifname "virbr0" tcp dport 53 accept
+# Start of fragment order:50 rulename:default_in-qemu_udp_dns
+ iifname "virbr0" udp dport 53 accept
# Start of fragment order:50 rulename:default_in-ssh
tcp dport {22} accept
# Start of fragment order:50 rulename:default_in-syncthing
--- /etc/nftables/puppet/inet-filter-chain-default_fwd.nft 2023-11-28 15:47:58.151874290 -0500
+++ /tmp/puppet-file20231128-15717-rv4jlv 2023-11-28 15:59:57.903321806 -0500
@@ -1,4 +1,10 @@
# Start of fragment order:00 default_fwd header
chain default_fwd {
+# Start of fragment order:50 rulename:default_fwd-qemu_iip_v4
+ iifname "virbr0" ip saddr 192.168.122.0/24 accept
+# Start of fragment order:50 rulename:default_fwd-qemu_io_internal
+ iifname "virbr0" oifname "virbr0" accept
+# Start of fragment order:50 rulename:default_fwd-qemu_oip_v4
+ oifname "virbr0" ip daddr 192.168.122.0/24 ct state related,established accept
# Start of fragment order:99 default_fwd footer
}
Note that the network range matters here, it needs to match the one visible in the output of:
virsh net-dumpxml default
Also note that I previously included the nftables::rules::qemu
class
as is, but that broke virtd networking with this error:
error: internal error: Failed to apply firewall rules /usr/sbin/iptables -w --table nat --list-rules: # Warning: iptables-legacy tables present, use iptables-legacy to see them
The solution was to do the above masquerade => false
. Or, in a diff:
--- /etc/nftables/puppet/ip-nat-chain-POSTROUTING.nft 2023-11-28 14:55:32.881506364 -0500
+++ /tmp/puppet-file20231128-9849-fc3war 2023-11-28 15:47:58.163874281 -0500
@@ -4,15 +4,5 @@
type nat hook postrouting priority 100
# Start of fragment order:02 rulename:POSTROUTING-policy
policy accept
-# Start of fragment order:50 rulename:POSTROUTING-qemu_ignore_broadcast
- ip saddr 192.168.122.0/24 ip daddr 255.255.255.255 return
-# Start of fragment order:50 rulename:POSTROUTING-qemu_ignore_multicast
- ip saddr 192.168.122.0/24 ip daddr 224.0.0.0/24 return
-# Start of fragment order:50 rulename:POSTROUTING-qemu_masq_ip
- ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade
-# Start of fragment order:50 rulename:POSTROUTING-qemu_masq_tcp
- meta l4proto tcp ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade to :1024-65535
-# Start of fragment order:50 rulename:POSTROUTING-qemu_masq_udp
- meta l4proto udp ip saddr 192.168.122.0/24 ip daddr != 192.168.122.0/24 masquerade to :1024-65535
# Start of fragment order:99 POSTROUTING footer
}
The IP address distributed by dnsmasq also doesn't seem quite correct,
as it's trying to reach 10.0.2.3
for some reason. I had to do this
for DNS to work:
echo nameserver 192.168.122.1 > /etc/resolv.conf
This might be solved by hardcoding a DNS server in systemd-networkd
or elsewhere.
IPv6 is configured by default, so if you're on a IPv4-only network,
some problems are likely to occur. The fix is to edit the network and
remove the <ip>
block for IPv6:
service libvirtd stop
virsh net-destroy default
virsh net-edit default
virsh net-start default
service libvirtd start
Base image build
Then we can build an image using virt-builder:
virt-builder debian-9 --size=10G --format qcow2 \
-o /var/lib/libvirt/images/stretch-amd64.qcow2 \
--update \
--firstboot-command "dpkg-reconfigure openssh-server" \
--network --edit /etc/network/interfaces:s/ens2/ens3/ \
--ssh-inject root:string:'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC7CY6+aTLlk6epl1+TK6wIaHg1fageEfmKFgn+Yov+2lKFIhNRkcWznQVcyViVmC7iaZkEIei1gP9+0lrsdhewtTBjvkDNxR18aIORJsiH95FFjFIuJ0HQjrM1jOxiXhQZ0xLlnhFkxxa8j9l52HTutpYUU63e3lvY0CBuqh7QtkH3un7iT6EaqMR34yFa2ym35ag8ugMbczBwnTDJYn3qpL8gKuw3JnIp+qdSQb1sGdLcC4JN02E2/IY7iw8lzM9xVab1IgvemCJwS0C/Bt9LsmhCy9AMpaVFaAYjepgdBpSqIMa/8VcoVOrhdJWfIc7fLtt+njN1qojsPmuhsr1n' \
--hostname stretch-amd64 --timezone UTC
This is not ideal, as it fetches the base image from libguestfs.org
,
in the clear (as opposed to debian.org
infrastructure):
[ 1.9] Downloading: http://libguestfs.org/download/builder/debian-9.xz
There is, fortunately, an OpenPGP signature on those images but it
might be better to bootstrap using debootstrap
(although
bootstrapping using the above might be much faster).
Also notice how we edit the interfaces
file to fix the interface
name. For some reason, the interface detected by virt-builder
isn't
the same that shows up when running with virt-install
, below. The
symlink trick does not work: adding --link
/dev/null:/etc/systemd/network/99-default.link
to the virt-builder
incantation does not disable those funky interface names. So we
simply rewrite the file.
Finally, we inject our SSH key in the root account. The build process will show a root password but we won't need it thanks to that.
If the build fails with this error:
[ 156.9] Resizing (using virt-resize) to expand the disk to 10.0G
virt-resize: error: libguestfs error: /usr/bin/supermin exited with error
status 1.
It might be that you ran out of space in /var/tmp
. You can use
TMPDIR
to switch to a larger directory.
Autopkg builders
Images can also be built thanks to autopkgtest which itself delegates the job to vmdb2, with something like:
sudo autopkgtest-build-qemu stable /var/lib/libvirt/images/debian9-amd64-autopkgtest.qcow2
There are obviously many, many more options for building such images, that's just the ones I found the most practical.
Official images
An alternative way of getting a base image is to just download images from https://cloud.debian.org/. They have QCOW2 images that are minimal and can serve as a template for multiple VMs. For example, this downloads the latest bookworm build:
cd /var/lib/libvirt/images
curl -L -O https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-nocloud-arm64-daily.qcow2
The above is a really bare image. You might want cloud-init to make your life easier:
curl -L -O https://cloud.debian.org/images/cloud/bookworm/daily/latest/debian-12-genericcloud-arm64-daily.qcow2
Virtual machine creation
Then the virtual machine can be created and started with:
virt-install --import --noautoconsole \
--memory 1024 \
--name debian-12-amd64-test \
--disk path=/var/lib/libvirt/images/debian-12-amd64-test.img
The path
argument can be simplified by using existing volume pools,
which can be listed with:
# virsh pool-list
Name State Autostart
-------------------------------------------
boot-scratch active yes
default active yes
[[!tip """Notice how the virsh command is called as root. That's not
absolutely necessary, but by default when called as a user, it will
connect to the user-specific session (qemu:///session
) instead of
the system-level one (qemu:///system
). This can be worked around by
using the --connect qemu:///system
argument or by changing the
default URI.
The actual path of the volume pool can be found with:
# virsh pool-dumpxml default | grep path
<path>/var/lib/libvirt/images</path>
Then a machine can be created in the pool with the --disk
vol=default/debian-12-amd64-test.qcow2
argument.
Note that the virtual machine will directly write to the qcow
image
file. To work on a temporary file, you can create one with:
cd /var/lib/libvirt/images/
qemu-img create -f qcow2 -o backing_file=debian-12-nocloud-amd64-daily.qcow2,backing_fmt=qcow2 debian-12-amd64-test.img 10G
This guide was previously suggesting this command, which doesn't seem to work anymore, to test:
qemu-img create -f qcow2 -b debian9-amd64-autopkgtest.qcow2 overlay.img
IP address discovery
The VM will be created with an IP address allocated by the DHCP
server. The latter logs (or tcpdump -n -i any -s 1500 '(port 67 or
port 68)'
) will show the IP address, otherwise the root password will
be necessary to discover it.
Alternatively, the IPv6 address of the guest can be deduced from the
IP address of the host's vnet0
interface. For example, here's the
interface as viewed from the host:
45: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UNKNOWN group default qlen 1000
link/ether fe:54:00:1e:c2:48 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fc54:ff:fe1e:c248/64 scope link
valid_lft forever preferred_lft forever
And from the guest:
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:1e:c2:48 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.216/24 brd 192.168.0.255 scope global ens3
valid_lft forever preferred_lft forever
inet6 fd05:5f2d:569f:0:5054:ff:fe1e:c248/64 scope global mngtmpaddr dynamic
valid_lft 7054sec preferred_lft 1654sec
inet6 2607:f2c0:f00f:8f00:5054:ff:fe1e:c248/64 scope global mngtmpaddr dynamic
valid_lft 7054sec preferred_lft 1654sec
inet6 fe80::5054:ff:fe1e:c248/64 scope link
valid_lft forever preferred_lft forever
Notice how the MAC addresses are almost identical? Only the prefix
differ: fe
on the host and 52
on the guest. This might be used to
guess the IPv6 IP of the guest to administer the machine. The local
segment IPv6 multicast address (ff02::1
) can be used to confirm
the IP address:
# ping6 -I br0 ff02::1
ping6: Warning: source address might be selected on device other than br0.
PING ff02::1(ff02::1) from :: br0: 56 data bytes
[...]
64 bytes from fe80::5054:ff:fe1e:c248%br0: icmp_seq=1 ttl=64 time=0.281 ms (DUP!)
[...]
^C
--- ff02::1 ping statistics ---
1 packets transmitted, 1 received, +4 duplicates, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.049/0.339/0.515/0.166 ms
That latter MAC address is also known by libvirt
so this command
will show the right MAC:
# virsh domiflist stretch-amd64
Interface Type Source Model MAC
-------------------------------------------------------
vnet0 bridge br0 virtio 52:54:00:55:44:73
And obviously, connecting to the console and running ip a
will show
the right IP address, see below for console usage.
Note that netfilter might be firewalling the bridge. To disable, use:
sysctl net.bridge.bridge-nf-call-ip6tables=0
sysctl net.bridge.bridge-nf-call-iptables=0
sysctl net.bridge.bridge-nf-call-arptables=0
See also the sbuild / qemu blog post for details on how to integrate sbuild images with libvirt.
Maintenance
List running VMs:
virsh list
To start a VM:
virsh start stretch-amd64
Get a console:
virsh console stretch-amd64
To stop a VM:
virsh shutdown stretch-amd64
Start a VM already created:
virsh start stretch-amd64
To kill a VM that's hung:
virsh destroy stretch-amd64
To reinstall a VM, the machine needs to be stopped (above) and the namespace reclaimed (source):
virsh undefine stretch-amd64
Connecting to a remote libvirt instance
Assuming that (a) your user can run commands like virsh list
and (b)
you can access that user using SSH, you can actually manage a remote
libvirt server with virt-manager
or virsh
using remote
URIs. For example, this will connect to the remote libvirt machine
using virsh
:
virsh -c qemu+ssh://user@example.com/system list
A similar URL can be used in virt-manager
, which allows you to
connect to the remote console easily, for example. Pretty neat.
Remaining tasks
/etc/default/libvirt-guests
defines how guests are started -virsh autostart
can enable automatic restarts, remains to be testedvirsh domifaddr
should normally show the IP address of the guest, but it's possible this does not work in bridge modedisk images like
qcow2
might be too slow for production use, we should use LVM insteadmaybe consider virt-lightning to improve startup times
References
- libvirt handbook bridge configuration
- libvirt wiki networking configuration
- a good libvirt networking handbook
- Arch Linux wiki page
- Debian wiki KVM reference - also includes tuning options for disks, CPU, I/O
- nixCraft guide - which gave me the
virt-builder
shortcut (instead of installing Debian from scratch using an ISO!) - the virsh manual page is excellent
Container notes
Those are notes and reminders of how to do "things" with containers, regardless of technology. The are not a replacement for the official documentation and may only be useful for myself.
Docker
To build an image:
docker build --tag foo
That will create an image named "foo" (even if it says --tag
, that's
actually the image name, whatever).
To enter a container:
docker run --tty --interactive foo /bin/bash
To map volumes to containers, which images pre-define certain
VOLUME
, first create a volume:
docker volume create foo
Then use it in the container:
docker run --volume foo:/srv/foo /bin/bash
Containers are basically a directory stored in
/var/lib/docker/volumes
which can be copied around normally.
To restart a container on reboot, use --restart=unless-stopped
or
--restart=always
, as documented.
Restarting containers
A common problem I have is I forget how I started a given
container. When it's stopped or crashed or upgraded, I don't know how
to restart it with the same arguments. There's docker inspect
that
will tell me the arguments passed to the container, but not flags like
environment variables, mountpoints. Those can be deduced from the
JSON output, but it's unclear what's default and what was actually
specified by hand.
For this, the runlike tool is useful:
# docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike grafana
docker run --name=grafana --hostname=dd2130c9306c --user=grafana --env="GF_METRICS_ENABLED=true" --env="GF_ANALYTICS_REPORTING_ENABLED=false" --env="GF_USERS_ALLOW_SIGN_UP=false" --env="GF_ALTERTING_ENABLED=false" --env="PATH=/usr/share/grafana/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" --env="GF_PATHS_CONFIG=/etc/grafana/grafana.ini" --env="GF_PATHS_DATA=/var/lib/grafana" --env="GF_PATHS_HOME=/usr/share/grafana" --env="GF_PATHS_LOGS=/var/log/grafana" --env="GF_PATHS_PLUGINS=/var/lib/grafana/plugins" --env="GF_PATHS_PROVISIONING=/etc/grafana/provisioning" --volume="grafana-storage:/var/lib/grafana" -p 3000:3000 --restart=unless-stopped --detach=true grafana/grafana
It may be a little verbose, but it's a good basis to restart a container. The correct incantation turns out to be:
docker run --name=grafana --user=grafana --env="GF_METRICS_ENABLED=true" --env="GF_ANALYTICS_REPORTING_ENABLED=false" --env="GF_USERS_ALLOW_SIGN_UP=false" --env="GF_ALTERTING_ENABLED=false" --volume="grafana-storage:/var/lib/grafana" -p 3000:3000 --restart=unless-stopped grafana/grafana
For now I'm storing the canonical commandline in a "start-$image"
script (e.g. start-airsonic
, start-grafana
) but that seems
suboptimal.
Rocket
Running docker containers:
$ sudo rkt run --insecure-options=image --interactive docker://busybox -- /bin/sh
Those get resolved using the rkt image resolution.
Re-running:
$ sudo rkt run registry-1.docker.io/library/debian:latest --interactive --exec /bin/bash --net=host
Building images requires using the separate acbuild command which builds "standard" ACI images and not docker images. Other tools are available like Packer, umoci or Buildah, although only Buildah can use Dockerfiles to build images.