High throughput Site to site VPN on commodity hardware - an adventure with Wireguard, bonding and ECMP

At work, I recently had an interesting challenge - we needed a high throughput site to site VPN between two of our co-located DCs. The existing one was not cutting it with the increasing demand for bandwidth every week. Normally I'd opt for the tried and tested IPsec tunnels using strongswan. One of my mentors had done this in the past - here is his own write-up. But I wanted to do something different this time. Enter Wireguard, the newest kid on the VPN block. What follows is a brief write up on the attempts to push decent amount of traffic on WG - enough to meet our needs.

Here comes the 'interesting' part of the challenge - One side only has 1G LAN - the final ISP uplink is 10G - but the server is only connected to a 1G switch. We need to push more than 1Gbps over the tunnel. 
Here is how it looks:

+---------------------+                +----------------+
|                     |  Wireguard     |                |
|   Site A - 1G LAN   +------------->   Site B - 10G LAN|
|      10G Uplink     +v------------+      10G Uplink   |
|       10.2.3.4/24   |                |  10.5.6.7/24   |
+---------------------+                +----------------+

First, we need to solve the 1G bottleneck on the LAN side - if the site A host can't receive more than 1G, our 'more than 1Gbps' tunnel will never work. To solve this, we choose bonding - specifically, LACP. We choose to bond 3 ports - the switch side bond needs to be configured in LACP 802.3ad mode.
On linux side, this is what we do:

ip link add bond0 type bond0
ip link set bond0 type bond miimon 100 mode 802.3ad
ip link set eno1 master bond0
ip link set eno2 master bond0
ip link set eno3 master bond0
ip link set bond0 up
ip a add 10.2.3.4/24 dev bond0
Okay, let's test this out. We bond 3 interfaces in another host similarly and fire up iperf3 with 2 connections.

admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-4.81   sec   272 MBytes   474 Mbits/sec    0 sender
[  4]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver
[  6]   0.00-4.81   sec   272 MBytes   473 Mbits/sec    0 sender
[  6]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver
[SUM]   0.00-4.81   sec   543 MBytes   947 Mbits/sec    0 sender
[SUM]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver

Well, that's a bit anti-climactic! Didn't we bond the interfaces, so shouldn't we be able to push above 1Gbps if not the entire 3 Gbps? Not quite.
By default LACP bonding in linux chooses a Layer2 hash of MAC addresses to distribute between the bonded interfaces - so would mean the traffic will flow over only one interface - which is a 1G interface. We need to figure out a way to change this. In this specific case, we need to be able distribute traffic between multiple interfaces. So we do this to tell the kernel just that:

admin@10.2.3.4:~# cat sys/class/net/bond0/bonding/xmit_hash_policy
layer2 0
admin@10.2.3.4:~# echo 1 > /sys/class/net/bond0/bonding/xmit_hash_policy
admin@10.2.3.4:~# cat /sys/class/net/bond0/bonding/xmit_hash_policy
layer3+4 1

Here is what the Kernel documentation says:
layer3+4
This policy uses upper layer protocol information,when available, to generate the hash. This allows for traffic to a particular network peer to span multiple slaves, although a single connection will not span
multiple slaves.
 One thing to note is that this mode is not 802.3ad compliant - so packets may arrive out of order. But for our use-case it is fine, since we are not too sensitive to out of order arrivals for the intended application.

Let's try again:
admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
[ ID] Interval           Transfer     Bandwidth       Retr
<snip>
[SUM]   0.00-7.35   sec  1.61 GBytes  1.89 Gbits/sec    0            sender
[SUM]   0.00-7.35   sec  0.00 Bytes  0.00 bits/sec      0          receiver

Now it's working. Now that is out of the way, we proceed to the Wireguard bit. 

For installation and configuration, this Linode write-up is a very good starting point.

We will skip over the installation bits and jump straight to configuring it:

admin@10.2.3.4:~# cat /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.2/28

[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.254.0/28,10.5.6.0/24
PersistentKeepalive = 10

Here 10.5.6.0/24 is the site B's LAN subnet as indicated in the network diagram.
The address used for the wireguard interfaces don't really matter as long as they don't collide with existing addresses on the server.

Similarly on Site B:

[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.1/28

[Peer]
PublicKey = <local_pub_key>
AllowedIPs = 10.255.254.0/28,10.2.3.0/24
Endpoint= 45.67.89.1:5000
PersistentKeepalive = 10

(45.67.89.1 is a placeholder for site B's public IP)


Bring up the tunnels using wg-quick, and here is what it looks like on Site A:

admin@10.2.3.4:~# wg
interface: wg0
  public key: <scrubbed>
  private key: (hidden)
  listening port: 5000
peer: <scrubbed>
  endpoint: 5000
  allowed ips: 10.255.254.0/28, 10.5.6.0/24
  latest handshake: 1 minute, 36 seconds ago
  transfer: 25.16 GiB received, 13.37 GiB sent
  persistent keepalive: every 10 seconds
Now let's add routes on the servers on both sites so that the gateway for the site is through the local wg peer.

The server on A site that wants to take site B would look like this:

admin@10.2.3.6:~# ip route add 10.5.6.0/24 via 10.2.3.4
Similarly add routes on site B as well.
Now, onward to iperf3 to test our shiny new tunnel.

From site A
admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd <snip>
[SUM]   0.00-10.00  sec  1.04 GBytes   893 Mbits/sec  432    sender
[SUM]   0.00-10.00  sec  1.02 GBytes   874 Mbits/sec        receiver
Close to 1G line rate (as close as we can expect for encrypted traffic) - but it is still below 1Gbps. How do we push this beyond 1Gbps?

Enter ECMP - Equal Cost Multi-path Routing. Linux has had this ability for a long time though it is not very widely used. The folks at Cumulus Networks have an excellent write up here if you're curious about the evolution of ECMP in Linux. 
The rough idea is that if we have multiple paths each capable of pushing 1Gbps, then we can distribute and route the packets over those paths (which will be wg tunnels in our case). But first, we need the multiple paths.

We simply have to add one more config at /etc/wireguard/wg1.conf - similar to wg0 - except increment the ports on both sides so that the bonding LACP algorithm on side A will pick a different interface after hashing on the port number. Port change is also needed because wireguard listens on all interfaces by default and there is no way to change this - so changing IP won't work.

Well, once we add the config, let's try to bring up the wg1 tunnel:

admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28

[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.253.0/28,10.5.6.0/24
PersistentKeepalive = 10

Note the changes in Address parameter under Interface - we do this to avoid collision with previous wg0 tunnel. However it fails to come up:
admin@10.2.3.4:~# systemctl status wg-quick@wg1
<snip>
: [#] ip link add wg1 type wireguard
: [#] wg setconf wg1 /dev/fd/63
: [#] ip -4 address add 10.255.253.2/28 dev wg1
: [#] ip link set mtu 1420 up dev wg1
: [#] ip -4 route add 10.5.6.0/24 dev wg1
: RTNETLINK answers: File exists
<snip>
One more road-block. Since wg0 already exists and has inserted route for Site B in the routing tables, the second tunnel fails to come up. Not to fret. Wireguard provides a neat little parameter called Table. Set it to off, and Wireguard will not add routes. Neat, huh?

admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28
Table = off
Now, onwards to the pièce de résistance - ECMP. We add ECMP route on the site A wg peer as below:

admin@10.2.3.4:~# ip route add 10.5.6.0/24 nexthop dev wg0 \ weight 1 nexthop dev wg1 weight 1
Similarly on the other side. Now iperf3:

admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
<snip>
[SUM]   0.00-10.00  sec  1.89 GBytes  1.62 Gbits/sec  119501             sender
[SUM]   0.00-10.00  sec  1.82 GBytes  1.56 Gbits/sec                  receiver
Voila!
Now we can add one more Wireguard path and scale it up even more.

Feel free to reach out at muthu dot raj at outlook dot com for feedback or questions.

Thanks @santhikrishna for proof-reading this.

References:
3. Wireguard quickstart - https://www.wireguard.com/quickstart/
 5. Cumulus Networks' write up on ECMP in Linux - https://cumulusnetworks.com/blog/celebrating-ecmp-part-one/

Comments

  1. man, this post is epic! yet I have a question: Would this approach be applicable with active/failover scenarios? In fact I'm looking for a solution to setup a HA cluster with OPNsense and make sure the Wireguard connection survives failovers. We're using just one server in the cloud but 2 gateways per site. How would this behave, setting up wg0 and wg1 pointing to the same subnets and configuring ECMP on the server? Yet traffic flows over only one gateway at a time.

    Your inputs would be highly appreciated.

    ReplyDelete

Post a Comment

Popular posts from this blog

Review: The Kite Runner

Review: Fault Lines: How Hidden Fractures Still Threaten the World Economy