Demystifying Proxmox VE: ZFS Replication, Failover, and Cluster Behavior

This post is the first post to a multi-post series ‘Demystifying Proxmox VE’ capturing & documenting my learnings & experiences as I tinker with this incredible open-source virtualization platform – Proxmox VE.

What Is Proxmox VE — and Why Everyone’s Talking about It

Proxmox Virtual Environment (Proxmox VE) is a powerful, open-source virtualization platform that combines KVM (Kernel-based Virtual Machine) and LXC (Linux Containers) into one integrated management stack. Out of the box, it also provides software-defined storage (ZFS, Ceph, LVM), built-in backups, and high-availability clustering — all accessible through a modern web interface or CLI.

Think of it as a complete data-center-in-a-box that runs equally well on enterprise servers and DIY homelab gear. We can spin up Windows and Linux VMs, create lightweight containers, manage networks and firewalls, replicate storage, and even orchestrate full clusters — all without licensing fees.

    Proxmox has become a staple in the homelab community. Its combination of:

    • A full-featured web GUI
    • Integrated storage and backups
    • Easy clustering

    For SMBs that can’t justify the cost or complexity of VMware vSphere, Proxmox offers enterprise-grade features without enterprise-grade licensing. Backup scheduling, live migration, and high availability all come standard. There’s even a paid support subscription for teams that need commercial assurance.

    Enterprises with mixed workloads or open-source strategies increasingly view Proxmox as a way to break away from vendor lock-in. With open APIs, REST endpoints, and full transparency, Proxmox fits well into hybrid or edge deployments — especially those driven by Linux or Kubernetes.

    When Broadcom acquired VMware in 2023, licensing models shifted dramatically:

    • Higher costs for perpetual licenses and renewals,
    • Discontinued products (like vSphere Essentials and free ESXi),
    • Restrictive subscriptions requiring minimum bundles.

    This left many IT pros searching for an affordable alternative. This opened up the doors for alternatives like Proxmox, Nutanix AHV, XCP-ng, Microsoft Hyper-V, etc.

    DIY’ers & homelab enthusiasts have already been using Proxmox for tinkering (due to it being open-source & free). This helped build a robust & incredibly collaborative community. The Broadcom-VMWare development brought it to the forefront & made it one of the strongest contenders.

    Overnight, thousands of organizations — from universities to managed service providers — started evaluating or migrating to Proxmox VE. Its open-source nature means:

    • No licensing audits,
    • No feature gating,
    • And long-term stability independent of corporate mergers.

    For many, Proxmox became the natural evolution path from VMware ESXi, especially for labs, edge servers, and SMB data centers.

    Beyond its flagship Proxmox Virtual Environment (VE) hypervisor, the Proxmox team offers a tightly integrated suite of open-source infrastructure tools that complement each other seamlessly.

    • Proxmox Backup Server (PBS) provides enterprise-grade, deduplicated backups for VMs, containers, and physical hosts with incremental snapshots and strong data integrity verification—perfect for both homelabs and production environments.
    • Proxmox Mail Gateway (PMG) acts as a full-featured email security solution, filtering spam, malware, and phishing before messages reach your internal mail servers.

    One can download the ISO files for the offerings here – https://www.proxmox.com/en/downloads

    Proxmox VE comprises of a few key software components that make clustering, synchronization, and high availability (HA) possible.
    Below are the major building blocks of a Proxmox VE cluster — what they do, and how they fit together.

    Corosync is the cluster communication engine used by Proxmox VE.
    It provides the messaging layer that allows all nodes in a cluster to exchange heartbeat signals, membership data, and quorum votes.

    In short, Corosync is the heartbeat of the Proxmox cluster — if it fails, the cluster loses coordination.

    Quorum is the rule that determines whether the cluster is allowed to make changes.
    It ensures that a majority of nodes agree on the current state before performing any actions like starting or migrating VMs.

    This inbuilt mechanism prevents “split-brain” — where two isolated groups of nodes both think they’re the primary cluster.

    Avoids data corruption by ensuring only one majority partition stays active.

    Basic Formula

    Quorum = (Number of Nodes / 2) + 1

    Example:

    3 nodes → quorum = 2

    5 nodes → quorum = 3

    If the number of online nodes drops below quorum, the cluster goes into read-only mode (no VM starts, migrations, or HA actions).

    You might ask – “He just started with a 3-node cluster. What happens when we are dealing with a 2-node cluster?

    Enter ‘QDevice’.

    A QDevice is a tie-breaker node that helps small clusters (usually 2-node setups) maintain quorum.
    It participates only in quorum voting — it doesn’t run VMs or hold storage.

    Example:
    In a 2-node cluster, both nodes must be online to keep quorum. If one goes down, the other loses quorum and stops making changes.
    Adding a QDevice gives you a third vote, so the cluster can maintain quorum with one node offline.

    Common choices for QDevice:

    • A lightweight VM on another host
    • A Raspberry Pi

    Basically, this can be any small Linux box running the ‘corosync-qdevice’ service

    pmxcfs is a distributed, in-memory file system built on SQLite and synchronized over Corosync.
    It stores all cluster-wide configuration files — things like VM definitions, storage configuration, HA groups, and user permissions.

    Key Points

    • Lives under /etc/pve/
    • Automatically synchronized across all nodes in the cluster
    • Read-only when quorum is lost (to prevent config corruption)

    The pve-cluster package ties Corosync and pmxcfs together.
    It’s the service layer that initializes cluster membership, mounts pmxcfs, and handles node joins and departures.

    It does things like:

    • Mount /etc/pve/ as a shared configuration directory
    • Manage node identity and cluster keys
    • Provide the backend for cluster commands (pvecm)

    These are the High Availability (HA) management daemons.
    They monitor nodes and automatically move or restart VMs and containers when failures occur.

    • Pve-ha-manager – Top-level service controlling HA policy.
    • Pve-ha-crm (Cluster Resource Manager) – Decides where to place resources
    • Pve-ha-lrm (Local Resource Manager) – Executes start/stop commands on each node

    They depend on Corosync and Quorum being healthy. Without quorum, HA actions are suspended.

    These services form the management and monitoring layer:

    • pvedaemon handles API and task execution.
    • pvestatd collects stats (CPU, memory, disk, etc.) from all nodes.
    • pveproxy runs the web GUI and REST API interface.

    They’re not cluster-critical but provide the control and visibility for administrators.

    While not cluster daemons, storage backends integrate tightly with the cluster configuration.

    • ZFS — local storage with built-in RAID and replication
    • LVM/LVM-thin — fast block storage on local or SAN disks
    • Ceph — distributed, cluster-wide storage with redundancy
    • NFS/iSCSI — external shared storage for HA

    Storage definitions live inside /etc/pve/storage.cfg, which pmxcfs replicates to all nodes.

    pvecm is the command-line tool used to manage cluster membership and quorum state.

    Some frequently used commands are as follows –

    • pvecm create     – Create new cluster
    • pvecm add <IP>               – Join another node
    • pvecm nodes                   – List all nodes
    • pvecm status                   – Show quorum and ring status

    Since I am using nested virtualization for this test, this is how I will be setting up the environment –

    • Host 1 –
      • VM Name – pxmx01
      • FQDN – pxmx01.minerva.lan (ease of accessing outside of lab environment)
      • IP Address – 10.0.26.3/25
      • Disks – 3 (Total – 650GB)
        • Disk 0 (Boot Disk) – 50GB
        • Disk 1 (ZFS Mirror Member#1) – 300GB
        • Disk 2 (ZFS Mirror Member#2) – 300GB
      • RAM – 16GB
      • vCPU – 4
    • Host 2 –
      • VM Name – pxmx02
        • FQDN – pxmx02.minerva.lan (ease of accessing outside of lab environment)
      • IP Address – 10.0.26.4/25
      • Disks – 3 (Total – 650GB)
        • Disk 0 (Boot Disk) – 50GB
        • Disk 1 (ZFS Mirror Member#1) – 300GB
        • Disk 2 (ZFS Mirror Member#2) – 300GB
      • RAM – 16GB
      • vCPU – 4

    Let’s get rolling!

     I created the nodes in my environment according to the specs above.

    Now, let’s fire them up and install Proxmox.

    I will be selecting the 50GB as my boot drive.

    Fill-in the locale & set a secure root password.

    I am assigning the static IP Address per my plan here.

    Confirm the submitted settings in the ‘Summary’ page and press ‘Install’ to kick off the installation process.

    Once installation is completed for both nodes, let’s access the GUI of both nodes.

    https://pxmx01.minerva.lan:8006

    https://pxmx02.minerva.lan:8006

    Now that boot disks are installed with Proxmox and accessible with FQDNs, let’s setup our mirrored pools.

    Let’s check the disk names and their metadata by running the command

    • sda is my boot/OS drive (mounted at /).
    • sdb and sdc are unused and show no mountpoint which means they are safe for ZFS.

    Now on each node, let’s run the following commands to create the mirrored pool –

    Then, we will add the mirrored pool to Proxmox.

    Once completed, we should be able to see a mirrored disk ‘vmdata’ on the left navigation pane.

    Now, let’s create a cluster by running the following command –

    In my environment, I will be naming the cluster – pxmx-cluster01 and I would like to initiate the cluster from pxmx01.

    To add pxmx02 to the same cluster, let’s execute the command

    So now, regardless of any node we access, it will show the same two nodes – pxmx01 & pxmx02.

    Cluster Quorum Status

    Our ZFS pool is looking good and healthy as well.

    Now, let’s create and configure the QDevice.

    This is what Proxmox’s Official Documentation says

    “Using a QDevice allows two-node clusters to maintain quorum by adding a third vote on an external host running the qnetd daemon.”
    “This host can be a small low-powered machine or VM running Debian.”

    As this can be a tiny device or VM running a stable version of Debian, I will be creating a small VM in my host environment with the following specs –

    • VM Name – pxmx-qdevice01
    • FQDN – Not required
    • IP Address – 10.0.26.5/25
    • Disks – 1 (Total – 8GB)
    • RAM – 1GB

    Proxmox also recommends a stable version of Debian. At the time of writing this article, Debian 13.1 (Trixie) is the stable version. So, I will choose that for my QDevice.

    Download link – https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/

    Let it Rip!

    For this small machine, we will just need

    • SSH Server
    • Standard System Utilities

    Once installation is complete, let’s login to the console of the VM to fetch the IP Address. So, let’s find out the target IP through the VM console.

    Let’s SSH to the IP address (10.0.26.87)

    Let’s get into the root permissions by typing ‘su –‘

    And run apt update to update the packages.

    Once updated, let’s install corosync packages to set this VM as a qdevice.

    Let’s check the status of the qdevice to ensure the corosync services are up and running or not.

    We are in business!

    Now, let’s update the IP Address of the qdevice in our hosts. We can choose any any one node for this (as both nodes are in cluster).

    The root password for the qdevice is showing as wrong.

    Researching this online, I found that the shown errors might be a bit misleading to many.

    This exact error happens when Debian’s root SSH login is disabled by default, even though the root user exists.
    So the password is correct — SSH is simply refusing to let root log in.

    This is normal for Debian 12/13. We just need to enable root SSH login temporarily.

    Let’s change the SSH config inside our qdevice.

    Uncomment the following line –

    Replace ‘prohibit-password’ to ‘yes’.

    Once changed, save the file and exit.

    OK, I am able to login but it looks like the system is not able to generate certs.

    bash: line 1: corosync-qdevice-net-certutil: command not found

    Researching online for a bit, found that this binary lives in the corosync-qdevice package on the Proxmox nodes, not on the QDevice VM. So pvecm qdevice setup can’t generate the certificates.

    Let’s install them on both nodes.

    We are set!

    Post this, I was able to add the QDevice successfully to the two Proxmox nodes and once added, the quorum vote count went to 3.

    I also want to ensure that the QDevice doesn’t lose its IP as this will be catastrophic for the cluster during node failure scenarios. So, I will set the IP Address (10.0.26.87) as persistent.

    Let’s get into the NIC settings in our QDevice –

    Update the file with this –

    Then save & exit.

    Now, let’s disable DHCP for the ens33 interface –

    Let’s restart the network services for the changes to take effect –

    Let’s verify if it retained our desired IP Address post restart –

    It did.

    Let’s check the cluster status as well.

    And the QDevice connectivity is still intact. Love it!

    Now that our cluster is setup, let’s configure ZFS & Replication

    What we know so far –

    • Nodes: pxmx01 and pxmx02
    • Each node has one ZFS mirror pool on Disk1+2
    • ZFS Pool name: vmdata
    • Goal: replicate VMs from pxmx01 → pxmx02

    So here is what I am planning to do with my setup –

    • I will have a Windows 2022 test VM running on pxmx01 (VM100)
    • VM100 will be stored in VM dataset – vmdata/vmstore on each node.
    • VM100 replicates to pxmx02 on 15-min intervals.
    • Test manual failover of VM100 from pxmx01pxmx02.
    • Test manual failback of VM100 from pxmx02pxmx01.

    You might ask –

    Short answer – ‘No, you don’t mandatorily need it’

    But it is one of best practices recommended by Proxmox. And there are valid reasons behind it –

    • Snapshot cleanliness – If you store VMs directly in vmdata, ZFS snapshots will apply to everything in the root.
    • Replication targeting is easier – Proxmox replication jobs send datasets. If you replicate from the pool root, ZFS may replicate more than expected, include unneeded datasets.
    • Changing compression or properties per dataset.
    • Easy separation of roles – vmdata/vmstore for VMs, vmdata/iso for ISOs, etc.

    This exercise with validate our 2-node build with controlled DR scenarios.

    Let’s create the VM Dataset on both nodes

    Now, let’s add ZFS storage pool in Proxmox.

    In the web UI:

    Go to Datacenter → Storage → Add → ZFS

    Fill:

    • ID: vmstore
    • ZFS Pool: vmdata/vmstore
    • Nodes: pxmx01, pxmx02
    • Content: check Disk image and Container

    Click Add

    Now both nodes have local ZFS storage defined.

    For our testing, I went ahead and spun up the two test VMs – VM100 in pxmx01 & VM101 in pxmx02.

    Navigate to Datacenter→pxmx01→VM100→Replication→Add a new Replication Job

    Post 15min, we should be able to see a successful replication job for both machines.

    We can also cross-replicate VMs the other way (if we want). So, I added a replication job for VM101 to be replicated from pxmx02pxmx01.

    Since we are not in an actual disaster and both nodes are healthy, the safest way to “test failover” for VM100 is to do an offline migration of the VM from pxmx01 to pxmx02.

    Because we are using ZFS storage replication, Proxmox will:

    • Reuse the already-replicated disks on pxmx02, and
    • Automatically flip the replication direction afterward (now from pxmx02 → pxmx01).

    Step 1 – Shut down the VM100 from GUI

    Step 2 – Highlighting VM100, click ‘Migrate’ on upper right.

    We can check the status of the job by clicking the job on the list and click on ‘Logs‘.

    The VM has migrated successfully.

    Step 3 – Power on VM100 on host pxmx02

    The IP settings were retained during migration.

    The replication job (associated with the VM) also got migrated and this is exactly what we want.

    Step 1 – Shutdown VM100 that now running in pxmx02.

    Step 2 – Migrate VM100 to pxmx01.

    VM100 is back in pxmx01 again.

    Step 3 – Start VM100 that now running in pxmx01

    And, we are back. Though this is a very lightweight test machine, the failover and failback times are incredibly fast.

    This is the most amount of automation we can achieve with this setup.

    My current environment uses:

    • Local ZFS storage on each node (vmstore dataset on each host)
    • ZFS replication → not shared storage
    • Nested virtualization through VMware Workstation
    • Two-node Proxmox cluster with QDevice quorum

    With this architecture, here’s the hard truth:

    ZFS replication is asynchronous, not shared.
    Each node has its own copy of the VM disks, but they are not the same dataset.

    This means:

    • HA auto-restart cannot occur because disks are not instantly shared.
    • I had to manually Stop → Migrate → start the VM.
    • Maximum automation = scheduled replication + offline migration failover.

    This is the limit of what ZFS replication with this kind of setup.

    To get real HA (VM auto-start on another node when one fails), you need shared, detached, cluster-visible storage, for example:

    • NFS Server (simple, low-cost lab HA)
      • Central NFS server hosts VM disks
      • Both Proxmox nodes mount the NFS share
      • Storage becomes shared
      • HA can automatically restart VMs
    • Ceph (enterprise-grade, fully distributed, true HA)
      • Shared block storage (RBD)
      • Immediate disk access from any node
      • Automatic VM recovery
      • Live migration
      • Fault tolerance and replication built into Ceph

    This is the real HA stack used in production Proxmox clusters.

    ZFS replication provides low-RPO and surprisingly fast RTO when tuned properly.
    Minute-level RPO is achievable using frequent incremental sends, and RTO is minimized when using dedicated multiGig links (better if aggregated) for replication traffic.

    While this remains a manual failover process, aggressive replication and clean network design allow ZFS to function as a lightweight, low-latency DR mechanism even without shared storage thereby making it a very promising contender for storage mechanism used with Proxmox VE.

    Stay tuned for future posts on this topic as I fiddle with other variations & builds.


    Discover more from allabout802.com

    Subscribe to get the latest posts sent to your email.

    Leave a comment