Tuesday, 4 February 2020

A Tale of Apache and Two Congestion Control Algorithms in Linux (With a guest appearance by Openstack and cURL)


Introduction:
The shortest intro would be these two cURL times - one for a bare metal server, and one for the Openstack VM - both running the same application, for a specific test URL.


Problem:


dnslookup: 0.109 | connect: 0.359 | appconnect: 0.000 | pretransfer: 0.359 | starttransfer: 0.624 | total: 6.739 | size: 710012
dnslookup: 0.016 | connect: 0.281 | appconnect: 0.000 | pretransfer: 0.281 | starttransfer: 0.531 | total: 6.958 | size: 710277
dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.530 | total: 5.351 | size: 709165

No problem:

dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.546 | total: 2.168 | size: 705631
dnslookup: 0.015 | connect: 0.281 | appconnect: 0.000 | pretransfer: 0.281 | starttransfer: 0.561 | total: 2.402 | size: 705354
dnslookup: 0.000 | connect: 0.265 | appconnect: 0.000 | pretransfer: 0.265 | starttransfer: 0.546 | total: 2.434 | size: 705485


For more details, read on.



Little background - 

The above difference in the cURL times is seen between a VM inside KVM on Openstack and a bare metal server. At work we were contemplating moving to Openstack for some work loads and increase the density of our colo DC presence. However, the delay in response for the above test done through cURL meant we cannot go ahead with adding this VM to production. Since I was a major proponent of Openstack inside the Org, it fell upon me to demystify this bit.

ACT - I

Now, back to the main story. I had managed to obtain a similar new generation server with lots of cores and close to a terabyte of memory and about 12TB of storage. Here I had to give similar VMs to convince the Project Manager that he can get similar cost advantage in DC, while also improving the density of our DC presence. The VM was launched with a similar config to the bare metal they were serving out of. It was to run simple Apache server with PHP and had the fancy schmancy SRIOV NIC because I wanted to squeeze every little bit of performance I could, to bolster my case.

ACT - II

The team completed their sanity test on the new server, but were tickled by one oddity - One of the tests which essentially returned a lot of information in a simple HTML page took thrice as much time on the new VM. While I independently started on the list mentioned in ACT III, the team's SRE looked at it and figured that Apache spent just as time processing the request, but the times started diverging only during the actual transfer. He even found the exact PHP function call that wrote the data back and figured that was the only slow one. (I had eliminated disk and CPU by this time). But why was it slow inside the VM, but not in the bare metal? This was a real head-scratcher.

Additionally, the team had reported that the test was fine in the browser (same time it took in the bare metal server), but cURL showed the delays they were concerned about.

ACT - III

Now, I began listing out the possible causes.

* Network
* CPU - some weird syscall that was taking a lot of time inside the VM / clock issue
* Disk / IO issue - fsync inside KVM with LVM storage is hundreds of times slower than bare metal  (this was before I had been informed that the request processing time was the same in both cases)

Let's talk about CPU first. The first thing that struck me was the test which basically was a page served by apache had parameter in the URL called time. AHA, this must be the little bugger, I thought. Why? This is why. Basically gettimeofdayclock_gettime syscalls were 77% slower in AWS - the Package cloud blog post (here) is a very good write up of the issue. I enthusiastically ran the little program given in this page, congratulating my memory. Alas, that led me nowhere. I was not affected by this little implementation detail. The clock source in use was KVM and it did not take any inordinate amount of time longer than the bare metal. I verified this with a tiny program of my own.

Next thing was to see if the disk was causing the issue. Note that even now I didn't really pay attention to the fact that apache finished processing the request in basically the same time. After a bit of work, I confirmed that disk was not touched at all for the processing of the request. I was growing desperate because, well what else was there that was screwing things up?

Interlude - Red herrings

Onwards to network then. I almost dismissed the network bit because well I had given the VM a SRIOV NIC, and it did not have the overhead of having to go through the host's network stack. Playing around, I hit the first red herring.
I had found that the test URL when hit by cURL by default resulted in a much larger page being transferred. Brainwave! Pass the gzip header to cURL: -H 'Accept-Encoding: gzip' and voila, there was no issue at all. It took the same time in the browser, curl and in both VM and bare metal. Alas, I rejoiced too soon.

The kind Sr. Engineer in the team pointed out that the same non-compressed page via cURL took less than half the time in the bare metal. Now the issue was more qualified - only when the output size was large (uncompressed) the delay was seen.

Second red herring was in the form of Network again. I had been given the URL to test and assumed that the page was served over a domain that resolved to a public IP. Wait a minute, I thought - what path does the VM take to the Internet? Sure enough, it was taking a different path and went out through a different, smaller link. (It is possible for incoming to a public IP and the outgoing path of the same public IP to be different.) Here, I declared this is the issue. I changed the path by altering routes on the host and expected the issue to disappear. Nope, it was still there. Because since the server was under testing, it actually resolved to a private IP, and the traffic flew over IPSEC tunnel from the office (from where we were testing) to the DC.

I was growing impatient by this time.

ACT -IV - Final Act

At this point, I fired up trusty old Wireshark after a packet capture with tcpdump.

Here is what I saw:



Leftmost is bare metal, right is the VM. X-Axis is 1200000 for bare metal, and about 240000 for the VM.

Now there is something, at last.
I did not get a chance to look at this further, due to other tasks from my own team. When I got back, I fired up sysctl. 

sysctl -a | grep tcp

net.ipv4.tcp_congestion_control caught my eye. It was bbr on the bare metal, but cubic on the VM. Well, that's weird. I fired up Google. I knew one of these was better than the other, but how much better, and could it really explain the difference I was seeing?


Figure 3 was all I needed.

I went back to the VM, changed the congestion algo and reran the cURL loop.


VM:
dnslookup: 0.002588 | connect: 0.254602 | appconnect: 0.000000 | pretransfer: 0.254696 | starttransfer: 0.510628 | total: 2.038362 | size: 775063
dnslookup: 0.002340 | connect: 0.255680 | appconnect: 0.000000 | pretransfer: 0.255761 | starttransfer: 0.511749 | total: 2.022469 | size: 719422
dnslookup: 0.008388 | connect: 0.260589 | appconnect: 0.000000 | pretransfer: 0.260655 | starttransfer: 0.516559 | total: 2.003774 | size: 719531

Bare metal:

dnslookup: 0.002593 | connect: 0.254832 | appconnect: 0.000000 | pretransfer: 0.254903 | starttransfer: 0.514009 | total: 2.145449 | size: 773954
dnslookup: 0.002504 | connect: 0.255856 | appconnect: 0.000000 | pretransfer: 0.256012 | starttransfer: 0.514497 | total: 2.071955 | size: 718803
dnslookup: 0.002650 | connect: 0.255524 | appconnect: 0.000000 | pretransfer: 0.255754 | starttransfer: 0.515312 | total: 2.066484 | size: 719941

Yay!!!

This is what the packet capture looked like - Left side is the VM and right side is the bare metal.




Now, why did this occur? and what is the crucial difference that accounts for this?

"Most algorithms will continue to ramp up until they experience a dropped packet; BBR, instead, watches the bandwidth measurement described above. In particular, it looks at the actual delivered bandwidth for the last three round-trip times to see if it changes. Once the bandwidth stops rising, BBR concludes that it has found the effective bandwidth of the connection and can stop ramping up; this has a good chance of happening well before packet loss would begin."
So now all that was left was to establish that packet loss indeed caused this, which would close this case. I looked at the packet capture - there were retransmissions in both captures.
Impatient, I fired up a ping with 0.01 second interval. There it was, 9.8% packet loss, mocking at me. This also explained why I could not reproduce the issue within the Colo or from the other Colo. Both had zero packet loss.

Well, there it was. An almost anti-climactic end to the adventure.
brr FTW!

Sunday, 2 February 2020

Review: This Is Going to Hurt: Secret Diaries of a Junior Doctor

This Is Going to Hurt: Secret Diaries of a Junior DoctorThis Is Going to Hurt: Secret Diaries of a Junior Doctor by Adam Kay
My rating: 5 of 5 stars


Thanks for the recommendation @Gokila!

This is a brilliant, light heart-ed read.
However, most of the entries in this Diary will sound strange to the Indian reader. Surely Doctors are not this humble, are they? And what is this blasphemy - a patient conversing on equal terms with the doctor? But the other side is not so pretty either.

Here in India, we have a different kind of Patient - Doctor relationship than the rest of the world. We simultaneously adore them and loathe them. A significant portion of the population wants their off-springs to try their hand at being doctors - but an equally large portion of the population thinks the doctors conspire to put them through painful and expensive procedures to mint money. (Or worse, to "train" at their expense)

We all have read about doctors getting attacked by mobs because someone passed away and the crowd thought it was because the doctors did not do enough. The Prime Minister of the country recently alleged that doctors were being bribed by big pharma companies with gadgets, trips and even women. Clearly, India has miles to go in both directions. We need doctors who would deign to step off their pedestal and treat their patients as equal beings (not in the context of medicine, just to be clear). And we need patients who are capable of seeing the human underneath the white coat. And they have their very-much-human existence to live through and that puts a strain on them that most of us do not know.

I guess this ended up being more of a rant than a book review, but then this book hits all the right notes that you take it for granted. Of course a collection of diary entries by a Junior doctor in the Great Britain is supposed  to be this funny, witty and articulate. I guess it goes to the credit of the author that one feels so.


View all my reviews

Thursday, 25 April 2019

Virtualization in DC vs AWS

Without getting into the larger debate of private cloud vs public cloud for the organization as a whole, this document tries to give the numbers that let us make a fair comparison in terms of performance - and performance alone between AWS and Private Cloud through Openstack.
The initial benchmarking and preparing these numbers stemmed out of the proposed DC migration to a west coast co-located DC at my current employer. The migration did not happen, but it gave us the opportunity to study the performance impact of having our own cloud.
One of the gripes that I've had about our colo infra is that we never seem to have gotten grade-A processors. Almost all of the ones were hand-me-down servers or mid-level Intel Xeons, sometime with absurdly low clock speed. The storage side wasn't very brilliant either.
This was solved when we obtained a test bed from Dell with the latest and greatest hardware in their lab that we could setup and run benchmarks on.
The machine we got was this:
Hosted hardware in the environment consists of:
PowerEdge R740XD (Quantity: 1)
 (2) Intel Gold 6136 3.0GHz 12C processors
 384 GB RAM (12x 32GB configuration)
 Dell PERC H740P controller
 (6) 800GB SAS SSDs (RAID10 configuration)
 Mellanox 2 port 25GbE ConnectX-4 adapter
 Dell S5048 25 GbE 48-port switch
 Dell S3048 1GbE management switch (out-of-band management)
Note that this was not the top of the line Platinum processors coupled with dedicated chip for syscalls (nitro, in AWS parlance) that AWS had. But this represents a decent compromise between cost and performance and more importantly, this was something we could hope to replicate in our DC without having to trade an arm and a leg (of the company).
We decided to pick the latest and greatest in the AWS arsenal - a c5.4xlarge with a 100G EBS for pitting it against our VM which had the same configuration for CPU, Memory and Disk.
The VM was setup with Pass through CPU mode - meaning all the flags on the host CPU were exposed to the VM.
What to compare
Once we've gotten the hardware, we needed to decide on what exactly we wanted to compare with AWS. We knew already that we had no hope of competing with them in terms of storage. We also knew we can match AWS in terms of network performance because we had gotten SRIOV working - which guarantees bare-metal performance to VM NICs. We ultimately decided on these:
  • Raw CPU performance - Compute
  • Memory performance - Compute
  • Mixed Compute performance - Compilation tests
  • Raw Storage benchmarks
  • SQlite benchmarks - Real world approximation for storage intensive tasks
Care was taken to match OS, software and library versions as much as possible. We also accounted for the fact that we were only VM running on the host by dynamically disabling the extra cores on the host - meaning only thing that would come into play is the Virtualization overhead (or so we hope).
The host CPU governor was left in PowerSave mode.
The Raw compute benchmarks were also later replicated on the host that was running 3 VMs of 16 cores each, with the tests running on all of them simultaneously.
Below we have the real numbers that we obtained out of these tests, and we also discuss what these numbers may imply. We are presenting a subset of all the benchmarks that we did.
The tests were run using Phoronix test suite, and direct commands in case of OpenSSL.

Raw CPU performance

Single core ssl signing
VM in Dell HWAWS c5.4xlarge
sign verify sign/s verify/ssign verify sign/s verify/s
rsa 2048 bits 1763.6 61026.7rsa 2048 bits 1613.4 54149.8
Considering 2048 bit key 8.9% faster
Implications: 
Considering the CPU models that we see inside the AWS instances this is a bit suprising that our Gold CPU can be faster. It means that either the hypervisor of AWS doesn't allocate as much CPU shares to each vCPU as the underlying CPU can provide or 2) AWS platinum processors are decidedly high-core count but lower performance than generally available top-end Xeon chips.
 This is where the ugly face of what AWS calls Elastic compute unit comes into play - what you get as a core is actually defined as x number of ECUs rather than as cores on the CPU directly.
Multi core ssl signing
VM in Dell HWAWS c5.4xlarge
sign verify sign/s verify/ssign verify sign/s verify/s
rsa 2048 bits 0.000036s 0.000001s 27984.2 959951.4rsa 2048 bits 0.000077s 0.000002s 12994.5 437587.3
Considering 2048 bit key 109% faster
Implications:
Multi-core turbo scaling of commodity Xeon Golds is much better than in AWS where it is probably uneven / turned off to help with fitting into the ECU definition for each core - meaning the lion's share of power saving / performance improvement out of dynamic clock speeds goes to AWS.
One interesting tid-bit to mention here is that the Turbo Scaling that Intel advertises does not happen equally for all processor cores. It sort of linearly drops as the core count increases.
Check this graph out.

Memory

Redis benchmarking
This can be sort of a composite memory test with CPU being involved a fair bit.
VM in Dell HWAWS c5.4xlarge
Average: 1692634.75 Requests Per SecondAverage: 1638467.50 Requests Per Second
22% faster
Implications:
We believe that this is due to the spill-over effect of AWS CPU have lower raw compute performance.

Composite Performance

Linux Kernel Compile test
VM in Dell HWAWS c5.4xlarge
Average: 60.84 SecondsAverage: 99.57 Seconds
63% faster
Implications:
This is again a CPU intensive test, and the previous results percolate through.

Storage:

Before putting the numbers down here, let me state that storage is complicated business. What is raw storage performance is not what you get in the real world, and things like OS cache, filesystem choice, fsync choice, RAID cache, optimization by Qcow make harder to arrive at an apples to apples comparison.
So I am going to just paste the raw results of the entire test suite we ran for the Storage benchmarks below.
  1. AWS disk
  2. VM disk through QCOW
    Note: The links are down at the moment. I will fix them as soon as I get the chance.
I am more than happy to discuss the specifics of the storage tests over a chat / meet.

Database

Sqlite
We could not run the full postgresql test as part of the Phoronix, but we settled for sqlite due to the time crunch. Yes, we know its not a real DB.
VM in Dell HWAWS c5.4xlarge
Average: 5.65 Seconds - RAIDed SAS - SSDAverage: 19.81 Seconds
But these results are misleading
But here is an interesting thing,
Dell baremetalAWS c5.4xlarge
Average: 64.3825 secondsAverage: 19.81 Seconds
325% slower
Further Discussions:
By default fsync is performed after every transaction in sqlite test. When measuring raw fsync performance inside our VM, we found that it can be upto a 100 times slower for each fysnc call.
Check out the image below:

Which means the ridiculously low numbers that we were seeing with respect to Dell VM are probably due to a QCOW optimization.
Making sense of this involves a deep dive to the default KVM virtio storage cache mode, effects of OS cache, and effects of the RAID cache. Perhaps a separate write-up on this is in order.
But one thing we know for sure is that AWS has figured out how to drastically improve storage performance and that we cannot hope to match it unless we pay a storage vendor for a proprietary solution.
Final words:
Raw performance is not the only thing, not even the first thing that comes up as a consideration when deciding between private and public cloud. A cloud is much more than an infrastructure as a service platform. However, when choosing between the two, I believe when performance comes up, people usually lean on the public cloud side - this write-up hopes to clear things up (or muddy them further) in that aspect.

Wednesday, 20 March 2019

Interesting numbers #1

Interesting numbers will be a section on my blog where I point out statistics that I find are interesting, mildly amusing and provides a new, different perspective.



25%
TamilNadu is a key market for Education loans in India, accounting for 25% of all education loans disbursed across the country, amounting to around Rs.20,650 Crores.

1.5Million
No. of hectares of forest diverted for non-forest purposes through Forest Conservation Act
1.8Million
No. of tribals to be evacuated under a recent supreme court order, under the pretext of preventing protecting forests.

Thursday, 31 January 2019

On mentorship

Being a mentor is a not an easy thing. There are a lot of things that a mentor has to do, beyond imparting knowledge. To mentor is to mould and sculpt - to recognize there is a permanence to the teaching for better or worse. One of those things is choosing when to intervene as a mentor, and to fix the mess created by your student. My favorite description of the importance of this intervention is captured in a novel by Arthur Hailey, titled Airport. One of the pages has a story about an air traffic controller training a new recruit -

"George Wallace nodded and edged closer to the radarscope. He was in his mid-twenties, had been a trainee for almost two years; before that, he had served an enlistment in the U. S. Air Force. Wallace had already shown himself to have an alert, quick mind, plus the ability not to become rattled under tension. In one more week he would be a qualified controller, though for practical purposes he was fully trained now. Deliberately, Keith allowed the spacing between an American Airlines BAC-400 and a National 727 to become less, than it should be; he was ready to trasmit quick instructions if the closure became critical. George Wallace spotted the condition at once, and warned Keith, who corrected it. That kind of firsthand exercise was the only sure way the ability of a new controller could be gauged. Similarly, when a trainee was at the scope himself, and got into difficulties, he had to be given the chance to show resourcefulness and sort the situation out unaided. At such moments, the instructing controller was obliged to sit back, with clenched hands, and sweat. Someone had once described it as, "hanging on a brick wall by your fingernails." When to intervene or take over was a critical decision, not to be made too early or too late. If the instructor did take over, the trainee's confidence might be permanently undermined, and a potentially a good controller lost."

I have always had such mentors who knew when to step in. And to that, I owe them a lifetime of gratitude.

Wednesday, 26 December 2018

Acknowledging privileges

Privilege - A special advantage that people enjoy over others. It is this elusive idea , hard to understand for a lot of people. Why is it so? Why do so many people carry on with their lives completely oblivious to the privileges they enjoy by virtue of their very existence? How can so many people miss such an obvious thing as one that pervades all your life? Is it the extraordinary levels of insulation that people can enjoy in this society? Or is it merely an inability to think beyond their little ponds? Can people be pathologically incapable of recognising privileges that they enjoy and the others so cruelly denied?

It is not hard to find people like this. People who rant about having to pay taxes, or people who rant about reservation system or people who in general think that the society is unfair to them because they do not always get their way.

Don't get me wrong. People are well within their rights to question high taxes or nepotism or inordinate majoritarianism in cornering resources in name of reservation. But a little acknowledgement might not be out of place. The transactional attitude to life, where everything is evaluated in individual terms of profit or loss, is simply being wilfully deaf to cries for empathy and basic human decency.

People need to realise that they are all cogs in a huge machinery called society, and they no matter how extraordinarily talented they are and no matter how isolated they think their actions are, others still contribute to their success / gain. It may be hard to see, but it is because there are certain sections of the society who in so many subtle ways, sacrifice things that are valuable to me, I get to enjoy those privileges. I get to enjoy flexible timings at work, and I can jolly well proclaim that I deserve it because I line the pockets of my 'company' and that it is a fair compensation for the talent I bring in. But is it really? Isn't the rest of the society toiling away enabling me to enjoy this? I recognise that this is a slippery slope. The society is not toiling away for me in particular - they toil because that's what masses do - in absence of war or famine, people just work their asses off. But some people in the society enjoy a disproportionate fruit of the society's labour. While a significant portion of the population needs to run on time or risk getting penalised, a small portion gets to strut around oblivious to constraints of time. I get to enjoy food anytime I want because someone chooses to be away from the warmth of family and shelter to deliver my food. And it is not my case that they are not being compensated for it - they very well might be - but it is no one's case that these people are not missing out on some of the most basic things humans want to secure in their life. It is important to acknowledge that privilege I enjoy.

Yes, people get paid - but as the Indian privileged class is fond of claiming - money isn't everything. Acknowledging that certain people - people who work in the so called white collar industries are able to enjoy their privileges only because the society operates in a way which incentivises unequal rewards for equitable efforts and to recognise that we are not some unfeeling variables in an equation of a zero sum game is important.

But why is this acknowledgement so important? What is so wrong in being oblivious to your privilege as long it is not causing anybody any harm?


It is important because of the flippant attitude people have. People seem to think it is okay to behave in certain manner or mete out some treatment because they can somehow morally justify it to themselves, usually in terms of money. 'I pay more in taxes, than he earns / than the state gives me back'. Everyone must have heard a variation of this statement in their lifetime. What does this mean? This actually points to a deep ingrained denial of the value that the society adds to one's life. It is a particularly sociopathic justification of cruelty in terms of flimsy logic. It is refusal to acknowledge the privilege that one enjoys - by way of being able to reside in a peaceful country, enjoying a lawful state apparatus (for the most part at least), and relying on the society for everything from the very basic needs - from milk to manual scavengers on hire. Somehow everything is looked through a lens of efficiency and the humanity is scorched out, especially when it suits the person uttering these words.

There is a price that everyone pays for being in a society that adheres to / compelled to adhere to certain norms that resemble order and civility - some pay it more than others. Some have to make greater sacrifices than the others to enjoy the same things - not because of their talent or ability - but seemingly because they did not possess a special advantage a crucial juncture in their life. A privilege.

Sunday, 29 January 2017

Review: Justice: What's the Right Thing to Do?

Justice: What's the Right Thing to Do? Justice: What's the Right Thing to Do? by Michael J. Sandel
My rating: 5 of 5 stars

Ever felt that you are being meted out injustice due to reservations based on Caste?
Ever felt that the shopkeepers who sold Milk for 300 Rupees a packet and Water for 200 Rs a Can should be punished?
Ever felt strongly for or against same sex marriage? Euthanasia? Cannibalism?

If your answer to any of the questions is an yes, then you need to read this book. This book will take that question, rip it apart, then patch them up again, making you understand the various strands that held the question together in the first place.

One of those rare, must read books.

View all my reviews

Goodreads

my read shelf:
Muthu Raj's book recommendations, liked quotes, book clubs, book trivia, book lists (read shelf)