AWS throttles network bandwidth on cheaper instances

October 2nd, 2020 by Yuvraaj Kelkar

“Why is the git clone taking over three hours?”

We have a set up in which the build farm needs to clone the latest source code once a week per repository for ~30 git repositories. All but two are expected to complete within 10 minutes.
The “one clone per week” is part of a workflow to ensure limited network spend whether ingress or egress.

Last week, 8 or 9 of these clone tasks started at approximately the same time.
More than 3 hours later, the clones had all either failed after reporting timeout errors, TLS handshake errors, or curl errors.
After multiple retries, poring over git logs, kernel logs, stackoverflow, and even some non-standard appeals to higher powers, we had nothing more than a hunch that something was throttling ingress network traffic.

“Because you wanted a cheap instance for routing”

The build farm is made of multiple ephemeral worker nodes that are not allocated public IP addresses to reduce their attack surface.
They still need to clone from external git servers, so we use a “nat gateway instance” on AWS to route all inbound traffic to the worker nodes.
In retrospect, this instance being the single point of contact for all git clones should have been enough to point to the nat-gateway instance as the one on which AWS would enforce network throttling.
The instance was originally created as a t3.nano to keep costs low, because it was expected to be always on, unlike the worker nodes that are terminated when idle.

There’s no clear explanation of the network bandwidth offered to instance types in AWS. What is “Low” performance? What does “Moderate” even mean? There is no official explanation from Amazon.
Stackoverflow has an unofficial answer of course, and these days if there is an answer on Stackoverflow, it counts as an authoritative answer.

Quick fix

After experimenting a bit with instance types and starting all git clones at the same time, we changed the nat-gateway instance type from t3.nano to t2.medium.
That seemed to allow all git clones to succeed even when executed all in parallel.

Problem solved. For now.

What did it cost?

Let’s be honest here: AWS is not cheap and never was.
Getting the most performance for the price AWS charges is what drives a lot of our innovation.
What we had on our hands was somewhat like the infamous Slashdot effect or Reddit hug of death – only it was entirely self inflicted.
As always, the simplest fix was to throw more money at the problem: we just switched to a costlier instance type.
A t2.nano instance costs $0.0058 /hour. A t2.medium costs $0.0464 /hour. This represents an increase from $4.176 /month to $33.408 /month.
These numbers are inconsequential when taken on their own, but it is 8 times more expensive.
If we’re forced to employ these workarounds across multiple AWS products and features, this sort of 8x price jump eventually adds up.

So, while this might be fine as a short term fix, in the longer term we will change the Crave.io product so that the cloning of git repositories is distributed over time to remain below the throttling threshold.