High throughput Site to site VPN on commodity hardware - an adventure with Wireguard, bonding and ECMP
At work, I recently had an interesting challenge - we needed a high throughput site to site VPN between two of our co-located DCs. The existing one was not cutting it with the increasing demand for bandwidth every week. Normally I'd opt for the tried and tested IPsec tunnels using strongswan. One of my mentors had done this in the past - here is his own write-up. But I wanted to do something different this time. Enter Wireguard, the newest kid on the VPN block. What follows is a brief write up on the attempts to push decent amount of traffic on WG - enough to meet our needs.
Here is what the Kernel documentation says:
Here comes the 'interesting' part of the challenge - One side only has 1G LAN - the final ISP uplink is 10G - but the server is only connected to a 1G switch. We need to push more than 1Gbps over the tunnel.
Here is how it looks:
+---------------------+ +----------------+
| | Wireguard | |
| Site A - 1G LAN +-------------> Site B - 10G LAN|
| 10G Uplink +v------------+ 10G Uplink |
| 10.2.3.4/24 | | 10.5.6.7/24 |
+---------------------+ +----------------+
First, we need to solve the 1G bottleneck on the LAN side - if the site A host can't receive more than 1G, our 'more than 1Gbps' tunnel will never work. To solve this, we choose bonding - specifically, LACP. We choose to bond 3 ports - the switch side bond needs to be configured in LACP 802.3ad mode.
On linux side, this is what we do:
ip link add bond0 type bond0Okay, let's test this out. We bond 3 interfaces in another host similarly and fire up iperf3 with 2 connections.
ip link set bond0 type bond miimon 100 mode 802.3ad
ip link set eno1 master bond0
ip link set eno2 master bond0
ip link set eno3 master bond0
ip link set bond0 up
ip a add 10.2.3.4/24 dev bond0
admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
- - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-4.81 sec 272 MBytes 474 Mbits/sec 0 sender [ 4] 0.00-4.81 sec 0.00 Bytes 0.00 bits/sec receiver [ 6] 0.00-4.81 sec 272 MBytes 473 Mbits/sec 0 sender [ 6] 0.00-4.81 sec 0.00 Bytes 0.00 bits/sec receiver [SUM] 0.00-4.81 sec 543 MBytes 947 Mbits/sec 0 sender [SUM] 0.00-4.81 sec 0.00 Bytes 0.00 bits/sec receiver
Well, that's a bit anti-climactic! Didn't we bond the interfaces, so shouldn't we be able to push above 1Gbps if not the entire 3 Gbps? Not quite.
By default LACP bonding in linux chooses a Layer2 hash of MAC addresses to distribute between the bonded interfaces - so would mean the traffic will flow over only one interface - which is a 1G interface. We need to figure out a way to change this. In this specific case, we need to be able distribute traffic between multiple interfaces. So we do this to tell the kernel just that:
admin@10.2.3.4:~# cat sys/class/net/bond0/bonding/xmit_hash_policy layer2 0 admin@10.2.3.4:~# echo 1 > /sys/class/net/bond0/bonding/xmit_hash_policy admin@10.2.3.4:~# cat /sys/class/net/bond0/bonding/xmit_hash_policy
layer3+4 1
Here is what the Kernel documentation says:
layer3+4This policy uses upper layer protocol information,when available, to generate the hash. This allows for traffic to a particular network peer to span multiple slaves, although a single connection will not span
multiple slaves.
One thing to note is that this mode is not 802.3ad compliant - so packets may arrive out of order. But for our use-case it is fine, since we are not too sensitive to out of order arrivals for the intended application.
Let's try again:
admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
[ ID] Interval Transfer Bandwidth Retr<snip>
[SUM] 0.00-7.35 sec 1.61 GBytes 1.89 Gbits/sec 0 sender [SUM] 0.00-7.35 sec 0.00 Bytes 0.00 bits/sec 0 receiver
Now it's working. Now that is out of the way, we proceed to the Wireguard bit.
For installation and configuration, this Linode write-up is a very good starting point.
We will skip over the installation bits and jump straight to configuring it:
admin@10.2.3.4:~# cat /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.2/28
[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.254.0/28,10.5.6.0/24
PersistentKeepalive = 10
Here 10.5.6.0/24 is the site B's LAN subnet as indicated in the network diagram.
The address used for the wireguard interfaces don't really matter as long as they don't collide with existing addresses on the server.
Similarly on Site B:
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.1/28
[Peer]
PublicKey = <local_pub_key>
AllowedIPs = 10.255.254.0/28,10.2.3.0/24
Endpoint= 45.67.89.1:5000
PersistentKeepalive = 10
(45.67.89.1 is a placeholder for site B's public IP)
Bring up the tunnels using wg-quick, and here is what it looks like on Site A:
admin@10.2.3.4:~# wg
Now let's add routes on the servers on both sites so that the gateway for the site is through the local wg peer.interface: wg0 public key: <scrubbed> private key: (hidden) listening port: 5000peer: <scrubbed>
endpoint: 5000
allowed ips: 10.255.254.0/28, 10.5.6.0/24
latest handshake: 1 minute, 36 seconds ago
transfer: 25.16 GiB received, 13.37 GiB sent
persistent keepalive: every 10 seconds
The server on A site that wants to take site B would look like this:
admin@10.2.3.6:~# ip route add 10.5.6.0/24 via 10.2.3.4Similarly add routes on site B as well.
Now, onward to iperf3 to test our shiny new tunnel.
From site A
admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval Transfer Bandwidth Retr Cwnd <snip>
[SUM] 0.00-10.00 sec 1.04 GBytes 893 Mbits/sec 432 sender
[SUM] 0.00-10.00 sec 1.02 GBytes 874 Mbits/sec receiver
Close to 1G line rate (as close as we can expect for encrypted traffic) - but it is still below 1Gbps. How do we push this beyond 1Gbps?
Enter ECMP - Equal Cost Multi-path Routing. Linux has had this ability for a long time though it is not very widely used. The folks at Cumulus Networks have an excellent write up here if you're curious about the evolution of ECMP in Linux.
The rough idea is that if we have multiple paths each capable of pushing 1Gbps, then we can distribute and route the packets over those paths (which will be wg tunnels in our case). But first, we need the multiple paths.
We simply have to add one more config at /etc/wireguard/wg1.conf - similar to wg0 - except increment the ports on both sides so that the bonding LACP algorithm on side A will pick a different interface after hashing on the port number. Port change is also needed because wireguard listens on all interfaces by default and there is no way to change this - so changing IP won't work.
Well, once we add the config, let's try to bring up the wg1 tunnel:
admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28
[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.253.0/28,10.5.6.0/24
PersistentKeepalive = 10
Note the changes in Address parameter under Interface - we do this to avoid collision with previous wg0 tunnel. However it fails to come up:
admin@10.2.3.4:~# systemctl status wg-quick@wg1
<snip>
: [#] ip link add wg1 type wireguard
: [#] wg setconf wg1 /dev/fd/63
: [#] ip -4 address add 10.255.253.2/28 dev wg1
: [#] ip link set mtu 1420 up dev wg1
: [#] ip -4 route add 10.5.6.0/24 dev wg1
: RTNETLINK answers: File exists
<snip>One more road-block. Since wg0 already exists and has inserted route for Site B in the routing tables, the second tunnel fails to come up. Not to fret. Wireguard provides a neat little parameter called Table. Set it to off, and Wireguard will not add routes. Neat, huh?
admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28
Table = off
Now, onwards to the pièce de résistance - ECMP. We add ECMP route on the site A wg peer as below:
admin@10.2.3.4:~# ip route add 10.5.6.0/24 nexthop dev wg0 \ weight 1 nexthop dev wg1 weight 1
Similarly on the other side. Now iperf3:
admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval Transfer Bandwidth Retr Cwnd<snip>
[SUM] 0.00-10.00 sec 1.89 GBytes 1.62 Gbits/sec 119501 sender
[SUM] 0.00-10.00 sec 1.82 GBytes 1.56 Gbits/sec receiver
Voila!
Now we can add one more Wireguard path and scale it up even more.
Feel free to reach out at muthu dot raj at outlook dot com for feedback or questions.
Thanks @santhikrishna for proof-reading this.
References:
1. Kalyan's blog post on ECMP + Strongswan - https://blogs.eskratch.com/2018/10/how-we-have-systematically-improved.html
3. Wireguard quickstart - https://www.wireguard.com/quickstart/
4. Linode write up on setting up wireguard - https://www.linode.com/docs/networking/vpn/set-up-wireguard-vpn-on-ubuntu/
5. Cumulus Networks' write up on ECMP in Linux - https://cumulusnetworks.com/blog/celebrating-ecmp-part-one/
man, this post is epic! yet I have a question: Would this approach be applicable with active/failover scenarios? In fact I'm looking for a solution to setup a HA cluster with OPNsense and make sure the Wireguard connection survives failovers. We're using just one server in the cloud but 2 gateways per site. How would this behave, setting up wg0 and wg1 pointing to the same subnets and configuring ECMP on the server? Yet traffic flows over only one gateway at a time.
ReplyDeleteYour inputs would be highly appreciated.