Thursday, 18 June 2020

High throughput Site to site VPN on commodity hardware - an adventure with Wireguard, bonding and ECMP

At work, I recently had an interesting challenge - we needed a high throughput site to site VPN between two of our co-located DCs. The existing one was not cutting it with the increasing demand for bandwidth every week. Normally I'd opt for the tried and tested IPsec tunnels using strongswan. One of my mentors had done this in the past - here is his own write-up. But I wanted to do something different this time. Enter Wireguard, the newest kid on the VPN block. What follows is a brief write up on the attempts to push decent amount of traffic on WG - enough to meet our needs.

Here comes the 'interesting' part of the challenge - One side only has 1G LAN - the final ISP uplink is 10G - but the server is only connected to a 1G switch. We need to push more than 1Gbps over the tunnel. 
Here is how it looks:

+---------------------+                +----------------+
|                     |  Wireguard     |                |
|   Site A - 1G LAN   +------------->   Site B - 10G LAN|
|      10G Uplink     +v------------+      10G Uplink   |
|       10.2.3.4/24   |                |  10.5.6.7/24   |
+---------------------+                +----------------+

First, we need to solve the 1G bottleneck on the LAN side - if the site A host can't receive more than 1G, our 'more than 1Gbps' tunnel will never work. To solve this, we choose bonding - specifically, LACP. We choose to bond 3 ports - the switch side bond needs to be configured in LACP 802.3ad mode.
On linux side, this is what we do:

ip link add bond0 type bond0
ip link set bond0 type bond miimon 100 mode 802.3ad
ip link set eno1 master bond0
ip link set eno2 master bond0
ip link set eno3 master bond0
ip link set bond0 up
ip a add 10.2.3.4/24 dev bond0
Okay, let's test this out. We bond 3 interfaces in another host similarly and fire up iperf3 with 2 connections.

admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-4.81   sec   272 MBytes   474 Mbits/sec    0 sender
[  4]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver
[  6]   0.00-4.81   sec   272 MBytes   473 Mbits/sec    0 sender
[  6]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver
[SUM]   0.00-4.81   sec   543 MBytes   947 Mbits/sec    0 sender
[SUM]   0.00-4.81   sec  0.00 Bytes  0.00 bits/sec        receiver

Well, that's a bit anti-climactic! Didn't we bond the interfaces, so shouldn't we be able to push above 1Gbps if not the entire 3 Gbps? Not quite.
By default LACP bonding in linux chooses a Layer2 hash of MAC addresses to distribute between the bonded interfaces - so would mean the traffic will flow over only one interface - which is a 1G interface. We need to figure out a way to change this. In this specific case, we need to be able distribute traffic between multiple interfaces. So we do this to tell the kernel just that:

admin@10.2.3.4:~# cat sys/class/net/bond0/bonding/xmit_hash_policy
layer2 0
admin@10.2.3.4:~# echo 1 > /sys/class/net/bond0/bonding/xmit_hash_policy
admin@10.2.3.4:~# cat /sys/class/net/bond0/bonding/xmit_hash_policy
layer3+4 1

Here is what the Kernel documentation says:
layer3+4
This policy uses upper layer protocol information,when available, to generate the hash. This allows for traffic to a particular network peer to span multiple slaves, although a single connection will not span
multiple slaves.
 One thing to note is that this mode is not 802.3ad compliant - so packets may arrive out of order. But for our use-case it is fine, since we are not too sensitive to out of order arrivals for the intended application.

Let's try again:
admin@10.2.3.3:~ iperf3 -c 10.2.3.4 -P 2
[ ID] Interval           Transfer     Bandwidth       Retr
<snip>
[SUM]   0.00-7.35   sec  1.61 GBytes  1.89 Gbits/sec    0            sender
[SUM]   0.00-7.35   sec  0.00 Bytes  0.00 bits/sec      0          receiver

Now it's working. Now that is out of the way, we proceed to the Wireguard bit. 

For installation and configuration, this Linode write-up is a very good starting point.

We will skip over the installation bits and jump straight to configuring it:

admin@10.2.3.4:~# cat /etc/wireguard/wg0.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.2/28

[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.254.0/28,10.5.6.0/24
PersistentKeepalive = 10

Here 10.5.6.0/24 is the site B's LAN subnet as indicated in the network diagram.
The address used for the wireguard interfaces don't really matter as long as they don't collide with existing addresses on the server.

Similarly on Site B:

[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5000
Address = 10.255.254.1/28

[Peer]
PublicKey = <local_pub_key>
AllowedIPs = 10.255.254.0/28,10.2.3.0/24
Endpoint= 45.67.89.1:5000
PersistentKeepalive = 10

(45.67.89.1 is a placeholder for site B's public IP)


Bring up the tunnels using wg-quick, and here is what it looks like on Site A:

admin@10.2.3.4:~# wg
interface: wg0
  public key: <scrubbed>
  private key: (hidden)
  listening port: 5000
peer: <scrubbed>
  endpoint: 5000
  allowed ips: 10.255.254.0/28, 10.5.6.0/24
  latest handshake: 1 minute, 36 seconds ago
  transfer: 25.16 GiB received, 13.37 GiB sent
  persistent keepalive: every 10 seconds
Now let's add routes on the servers on both sites so that the gateway for the site is through the local wg peer.

The server on A site that wants to take site B would look like this:

admin@10.2.3.6:~# ip route add 10.5.6.0/24 via 10.2.3.4
Similarly add routes on site B as well.
Now, onward to iperf3 to test our shiny new tunnel.

From site A
admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd <snip>
[SUM]   0.00-10.00  sec  1.04 GBytes   893 Mbits/sec  432    sender
[SUM]   0.00-10.00  sec  1.02 GBytes   874 Mbits/sec        receiver
Close to 1G line rate (as close as we can expect for encrypted traffic) - but it is still below 1Gbps. How do we push this beyond 1Gbps?

Enter ECMP - Equal Cost Multi-path Routing. Linux has had this ability for a long time though it is not very widely used. The folks at Cumulus Networks have an excellent write up here if you're curious about the evolution of ECMP in Linux. 
The rough idea is that if we have multiple paths each capable of pushing 1Gbps, then we can distribute and route the packets over those paths (which will be wg tunnels in our case). But first, we need the multiple paths.

We simply have to add one more config at /etc/wireguard/wg1.conf - similar to wg0 - except increment the ports on both sides so that the bonding LACP algorithm on side A will pick a different interface after hashing on the port number. Port change is also needed because wireguard listens on all interfaces by default and there is no way to change this - so changing IP won't work.

Well, once we add the config, let's try to bring up the wg1 tunnel:

admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28

[Peer]
PublicKey = <peer's pub key>
AllowedIPs = 10.255.253.0/28,10.5.6.0/24
PersistentKeepalive = 10

Note the changes in Address parameter under Interface - we do this to avoid collision with previous wg0 tunnel. However it fails to come up:
admin@10.2.3.4:~# systemctl status wg-quick@wg1
<snip>
: [#] ip link add wg1 type wireguard
: [#] wg setconf wg1 /dev/fd/63
: [#] ip -4 address add 10.255.253.2/28 dev wg1
: [#] ip link set mtu 1420 up dev wg1
: [#] ip -4 route add 10.5.6.0/24 dev wg1
: RTNETLINK answers: File exists
<snip>
One more road-block. Since wg0 already exists and has inserted route for Site B in the routing tables, the second tunnel fails to come up. Not to fret. Wireguard provides a neat little parameter called Table. Set it to off, and Wireguard will not add routes. Neat, huh?

admin@10.2.3.4:~# cat /etc/wireguard/wg1.conf
[Interface]
PrivateKey = <local_priv_key>
ListenPort = 5001
Address = 10.255.253.2/28
Table = off
Now, onwards to the pièce de résistance - ECMP. We add ECMP route on the site A wg peer as below:

admin@10.2.3.4:~# ip route add 10.5.6.0/24 nexthop dev wg0 \ weight 1 nexthop dev wg1 weight 1
Similarly on the other side. Now iperf3:

admin@10.2.3.6:~# iperf3 -c 10.5.6.7 -P 10
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
<snip>
[SUM]   0.00-10.00  sec  1.89 GBytes  1.62 Gbits/sec  119501             sender
[SUM]   0.00-10.00  sec  1.82 GBytes  1.56 Gbits/sec                  receiver
Voila!
Now we can add one more Wireguard path and scale it up even more.

Feel free to reach out at muthu dot raj at outlook dot com for feedback or questions.

Thanks @santhikrishna for proof-reading this.

References:
3. Wireguard quickstart - https://www.wireguard.com/quickstart/
 5. Cumulus Networks' write up on ECMP in Linux - https://cumulusnetworks.com/blog/celebrating-ecmp-part-one/

Tuesday, 4 February 2020

A Tale of Apache and Two Congestion Control Algorithms in Linux (With a guest appearance by Openstack and cURL)


Introduction:
The shortest intro would be these two cURL times - one for a bare metal server, and one for the Openstack VM - both running the same application, for a specific test URL.


Problem:


dnslookup: 0.109 | connect: 0.359 | appconnect: 0.000 | pretransfer: 0.359 | starttransfer: 0.624 | total: 6.739 | size: 710012
dnslookup: 0.016 | connect: 0.281 | appconnect: 0.000 | pretransfer: 0.281 | starttransfer: 0.531 | total: 6.958 | size: 710277
dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.530 | total: 5.351 | size: 709165

No problem:

dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.546 | total: 2.168 | size: 705631
dnslookup: 0.015 | connect: 0.281 | appconnect: 0.000 | pretransfer: 0.281 | starttransfer: 0.561 | total: 2.402 | size: 705354
dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.546 | total: 2.434 | size: 705485


For more details, read on.



Little background - 

The above difference in the cURL times is seen between a VM inside KVM on Openstack and a bare metal server. At work we were contemplating moving to Openstack for some work loads and increase the density of our colo DC presence. However, the delay in response for the above test done through cURL meant we cannot go ahead with adding this VM to production. Since I was a major proponent of Openstack inside the Org, it fell upon me to demystify this bit.

ACT - I

Now, back to the main story. I had managed to obtain a similar new generation server with lots of cores and close to a terabyte of memory and about 12TB of storage. Here I had to give similar VMs to convince the Project Manager that he can get similar cost advantage in DC, while also improving the density of our DC presence. The VM was launched with a similar config to the bare metal they were serving out of. It was to run simple Apache server with PHP and had the fancy schmancy SRIOV NIC because I wanted to squeeze every little bit of performance I could, to bolster my case.

ACT - II

The team completed their sanity test on the new server, but were tickled by one oddity - One of the tests which essentially returned a lot of information in a simple HTML page took thrice as much time on the new VM. While I independently started on the list mentioned in ACT III, the team's SRE looked at it and figured that Apache spent just as time processing the request, but the times started diverging only during the actual transfer. He even found the exact PHP function call that wrote the data back and figured that was the only slow one. (I had eliminated disk and CPU by this time). But why was it slow inside the VM, but not in the bare metal? This was a real head-scratcher.

Additionally, the team had reported that the test was fine in the browser (same time it took in the bare metal server), but cURL showed the delays they were concerned about.

ACT - III

Now, I began listing out the possible causes.

* Network
* CPU - some weird syscall that was taking a lot of time inside the VM / clock issue
* Disk / IO issue - fsync inside KVM with LVM storage is hundreds of times slower than bare metal  (this was before I had been informed that the request processing time was the same in both cases)

Let's talk about CPU first. The first thing that struck me was the test which basically was a page served by apache had parameter in the URL called time. AHA, this must be the little bugger, I thought. Why? This is why. Basically gettimeofdayclock_gettime syscalls were 77% slower in AWS - the Package cloud blog post (here) is a very good write up of the issue. I enthusiastically ran the little program given in this page, congratulating my memory. Alas, that led me nowhere. I was not affected by this little implementation detail. The clock source in use was KVM and it did not take any inordinate amount of time longer than the bare metal. I verified this with a tiny program of my own.

Next thing was to see if the disk was causing the issue. Note that even now I didn't really pay attention to the fact that apache finished processing the request in basically the same time. After a bit of work, I confirmed that disk was not touched at all for the processing of the request. I was growing desperate because, well what else was there that was screwing things up?

Interlude - Red herrings

Onwards to network then. I almost dismissed the network bit because well I had given the VM a SRIOV NIC, and it did not have the overhead of having to go through the host's network stack. Playing around, I hit the first red herring.
I had found that the test URL when hit by cURL by default resulted in a much larger page being transferred. Brainwave! Pass the gzip header to cURL: -H 'Accept-Encoding: gzip' and voila, there was no issue at all. It took the same time in the browser, curl and in both VM and bare metal. Alas, I rejoiced too soon.

The kind Sr. Engineer in the team pointed out that the same non-compressed page via cURL took less than half the time in the bare metal. Now the issue was more qualified - only when the output size was large (uncompressed) the delay was seen.

Second red herring was in the form of Network again. I had been given the URL to test and assumed that the page was served over a domain that resolved to a public IP. Wait a minute, I thought - what path does the VM take to the Internet? Sure enough, it was taking a different path and went out through a different, smaller link. (It is possible for incoming to a public IP and the outgoing path of the same public IP to be different.) Here, I declared this is the issue. I changed the path by altering routes on the host and expected the issue to disappear. Nope, it was still there. Because since the server was under testing, it actually resolved to a private IP, and the traffic flew over IPSEC tunnel from the office (from where we were testing) to the DC.

I was growing impatient by this time.

ACT -IV - Final Act

At this point, I fired up trusty old Wireshark after a packet capture with tcpdump.

Here is what I saw:



Leftmost is bare metal, right is the VM. X-Axis is 1200000 for bare metal, and about 240000 for the VM.

Now there is something, at last.
I did not get a chance to look at this further, due to other tasks from my own team. When I got back, I fired up sysctl. 

sysctl -a | grep tcp

net.ipv4.tcp_congestion_control caught my eye. It was bbr on the bare metal, but cubic on the VM. Well, that's weird. I fired up Google. I knew one of these was better than the other, but how much better, and could it really explain the difference I was seeing?


Figure 3 was all I needed.

I went back to the VM, changed the congestion algo and reran the cURL loop.


VM:
dnslookup: 0.002588 | connect: 0.254602 | appconnect: 0.000000 | pretransfer: 0.254696 | starttransfer: 0.510628 | total: 2.038362 | size: 775063
dnslookup: 0.002340 | connect: 0.255680 | appconnect: 0.000000 | pretransfer: 0.255761 | starttransfer: 0.511749 | total: 2.022469 | size: 719422
dnslookup: 0.008388 | connect: 0.260589 | appconnect: 0.000000 | pretransfer: 0.260655 | starttransfer: 0.516559 | total: 2.003774 | size: 719531

Bare metal:

dnslookup: 0.002593 | connect: 0.254832 | appconnect: 0.000000 | pretransfer: 0.254903 | starttransfer: 0.514009 | total: 2.145449 | size: 773954
dnslookup: 0.002504 | connect: 0.255856 | appconnect: 0.000000 | pretransfer: 0.256012 | starttransfer: 0.514497 | total: 2.071955 | size: 718803
dnslookup: 0.002650 | connect: 0.255524 | appconnect: 0.000000 | pretransfer: 0.255754 | starttransfer: 0.515312 | total: 2.066484 | size: 719941

Yay!!!

This is what the packet capture looked like - Left side is the VM and right side is the bare metal.




Now, why did this occur? and what is the crucial difference that accounts for this?

"Most algorithms will continue to ramp up until they experience a dropped packet; BBR, instead, watches the bandwidth measurement described above. In particular, it looks at the actual delivered bandwidth for the last three round-trip times to see if it changes. Once the bandwidth stops rising, BBR concludes that it has found the effective bandwidth of the connection and can stop ramping up; this has a good chance of happening well before packet loss would begin."
So now all that was left was to establish that packet loss indeed caused this, which would close this case. I looked at the packet capture - there were retransmissions in both captures.
Impatient, I fired up a ping with 0.01 second interval. There it was, 9.8% packet loss, mocking at me. This also explained why I could not reproduce the issue within the Colo or from the other Colo. Both had zero packet loss.

Well, there it was. An almost anti-climactic end to the adventure.
brr FTW!

Sunday, 2 February 2020

Review: This Is Going to Hurt: Secret Diaries of a Junior Doctor

This Is Going to Hurt: Secret Diaries of a Junior DoctorThis Is Going to Hurt: Secret Diaries of a Junior Doctor by Adam Kay
My rating: 5 of 5 stars


Thanks for the recommendation @Gokila!

This is a brilliant, light heart-ed read.
However, most of the entries in this Diary will sound strange to the Indian reader. Surely Doctors are not this humble, are they? And what is this blasphemy - a patient conversing on equal terms with the doctor? But the other side is not so pretty either.

Here in India, we have a different kind of Patient - Doctor relationship than the rest of the world. We simultaneously adore them and loathe them. A significant portion of the population wants their off-springs to try their hand at being doctors - but an equally large portion of the population thinks the doctors conspire to put them through painful and expensive procedures to mint money. (Or worse, to "train" at their expense)

We all have read about doctors getting attacked by mobs because someone passed away and the crowd thought it was because the doctors did not do enough. The Prime Minister of the country recently alleged that doctors were being bribed by big pharma companies with gadgets, trips and even women. Clearly, India has miles to go in both directions. We need doctors who would deign to step off their pedestal and treat their patients as equal beings (not in the context of medicine, just to be clear). And we need patients who are capable of seeing the human underneath the white coat. And they have their very-much-human existence to live through and that puts a strain on them that most of us do not know.

I guess this ended up being more of a rant than a book review, but then this book hits all the right notes that you take it for granted. Of course a collection of diary entries by a Junior doctor in the Great Britain is supposed  to be this funny, witty and articulate. I guess it goes to the credit of the author that one feels so.


View all my reviews

Goodreads

my read shelf:
Muthu Raj's book recommendations, liked quotes, book clubs, book trivia, book lists (read shelf)