Running RavenDB on Burstable cloud Instances
One of the silent features of moving to the cloud is that it makes it the relation between performance and costs transparent. In your own data center it’s easy to lose track between the yearly budget and your application’s performance. But on the cloud you can track it quite easily.
This immediately leads you to try to optimize costs and use the minimal amount of resources you can get away with. This is great if you believe in efficient software.
One of the results of the focus on price and performance has been the introduction of burstable cloud instances. These instances allow you to host machines that do not need the full machine resources to run effectively. There are many systems whose normal runtime cost is minimal, with only occasional spikes. In AWS, you have the T series and Azure has the B series. Given how often RavenDB is deployed on the cloud, it shouldn’t surprise you that we are frequently being run on such instances. The cost savings can be quite attractive, around 15% in most cases. And in the case of small instances, savings can be even more impressive.
RavenDB runs quite nicely on a Raspberry PI, or a resource starved container, and for many workloads, that make sense. Looking at AWS in this case, consider the t3a.medium instance with 2 cores and 4 GB at $27.40 / month vs. a1.large (the smallest non burstable instance) with the same spec (but ARM machine) at $37.20 per month. For that matter, a t3a.small with 2 cores and 2 GB of memory is $13.70. As you can see, the cost savings adds up, and it makes a lot of sense to use the most cost effective solution.
Enter the problem with burstable instances. You have two options when you need more resources. Either you end up with throttling, or reducing the amount of CPU that you can use, or you get charged for the additional extra CPU power you needed.
Our own production systems which run all of our sites and backend systems are running on a cluster of 3 t3.large instances. Using RavenDB, the cost savings are significant. But what happens when you go above the limit? In our production systems, we have never had an instance where RavenDB used more than the provided burstable performance. It helps that the burstable machines allows us to accrue CPU time when we aren’t using it, but overall it means that we are doing a good job of handling requests efficiently. Here are some metrics from one of the nodes in the cluster during a somewhat slow period (the range is usually 20 – 200 requests / sec).
So we are pretty good in terms of latency and resource utilizations, that’s great.
However, the immediate response to seeing that we aren’t hitting the system limits is… to reduce the machine size again, to pay even less. One of the more interesting choices you can make with burstable instances is to ask to not go over the limit. If your system is using too much CPU, just take it away until it is done. Some workloads are just fine for this because there is no urgency. Some workloads, such as servicing requests, are less kind to this sort of setup.
Let’s consider the worst case scenario from our perspective. We have a user that runs the cheapest possible instance a t3a.nano with 2 cores and 512 MB of RAM costing just $3.40 a month. The caveat with this instance is that you have just 6 CPU credits / hour to your name. A CPU credit is basically 100% CPU for 1 minute. Another way of looking at this is that a t3a.nano instance has a budget of 360 CPU seconds per hour. If it uses more than that, it is charged a hefty fee (about ten times per hour than the machine cost).
We have users that disable the ability to go over the budget.
Let’s consider what is going to happen to such an instance when it hits an unexpected load. In the context of RavenDB, it can be that you created a new index on a big collection. But something seemingly simple such as generating a client certificate can also have a devastating impact on such an instance. RavenDB generates client certificates with 4096 bits keys. On my machine (a nice powerful dev machine), that can take anything from 300 – 900 ms of CPU time and cause a noticeable spike in one of the cores:
On the nano machine, we have measured key creation time of over six minutes.
The answer isn’t that my machine is 800 times more powerful than the cloud machine. The problem is that this takes enough CPU time to consume all the available credits, which cause throttling. At this point, we are in a sad state. Any additional CPU credits (and time) that we earn goes directly to the already existing computation. That means that any other CPU time is pushed back. Way back. This happens at a level below the operating system (at the hypervisor), so it isn’t even aware of it. What is happening from the point of view of the Operating System is that the CPU is suddenly much slower.
All of this is expected given the burstable nature of the system. But what is the observed behavior?
The instance appears to be frozen. All the cloud metrics will tell you that everything is green, but you can’t access the system (no CPU to accept new connections), you can’t SSH into it (same), and if you have an existing SSH connection, it appears to have frozen. Measuring performance from inside the machine shows that everything is cool. One CPU is busy (expected, generating the certificate, doing indexing, etc). Another is idle. But the system behaves as if it has no CPU available, which is exactly what is going on, except in a way that we can’t tell.
RavenDB goes to a lot of trouble to be a nice citizen and a good neighbor. This includes scheduling operations so the underlying OS will always have the required resources to handle additional load. For example, background tasks are run with lowered priority, etc. But when running on a burstable machine that is being throttled… well, from the outside it looked like certain trivial operations would just take the entire machine down and it wouldn’t be recoverable short of hard reboot.
Even when you know what is going on, it doesn’t really help you. Because from inside the machine, there is no way to access the cloud metrics in a good enough precision to take action.
We have a pretty strong desire to not get into these kind of situations, so we implemented what I can only refer to as counter measures. When you are running on a burstable instance, you can let RavenDB know what is your CPU credits situation, at which point RavenDB will actively monitor the machine and compute its own tally of the outstanding CPU credits situation. When we notice that the CPU credits are running short, we’ll start pro-actively halting internal processes to give the machine more space to recover from the CPU credits shortage.
We’ll move ETL processes and ongoing backups to other nodes in the cluster and even pause indexing in order to recover the CPU time we need. Here is an example of what you’ll see in such a case:
One of the nice things about the kind of promises RavenDB make about its indexes is that it is able to make these kind of decisions without impacting the guarantees that we provide.
If the situation is still bad, we’ll start proactively rejecting requests. The idea is that instead of getting of getting to the point where we are being throttled (and effectively down to the world), we’ll process a request by simply sending 503 Service Unavailable response back. This is going to be very cheap to do, so we won’t put additional strain on our limited budget. At the same time, the RavenDB’s client treat this kind of error as a trigger for a failover and will use another node to service this request.
This was a pretty long post to explain that RavenDB is going to give you a much nicer experience on burstable machines even if you don’t have bursting capabilities.
Oren Eini is the CEO and Chief Developer of RavenDB, a NoSQL Distributed Database that's Fully Transactional (ACID) both across your database and throughout your database cluster.