Breakeven Points Everywhere

March 19, 2024

Serverless is a boon for many businesses, it removes so much of operational and development costs of deploying workload on cloud. As a matter of fact, Opti Owl itself runs on AWS Fargate - a serverless compute engine to directly run containers rather than managing underlying EC2 Instances that run ECS/EKS Containers.

“Serverless is prohibitively expensive ‘at scale’. ”
“Serverless is cheaper for spiky and in-consistent workloads.”

These are some of the most popular idioms when it comes to judging serverless resource costs. But how do you know whether your workload reached “scale” or how do you judge whether your workload is in-consistent enough for the serverless to be cheaper than its counter part.

To build a proper mental model, we have to move beyond such generalisations and need a proper framework to judge in what scenarios what applies. Breakeven point analysis framework provides exactly this, by figuring out the breakeven point against different dimensions, it becomes much easier to proper compare serverless with non-serverless solutions.

Breakeven Point against Development & Maintenance Costs

Some aws services offer trade offs in terms of reduced development & maintenance costs. For example, AWS Fargate offers to reduce the maintenance of underlying EC2 fleet, AWS Lambda goes even further by solving the capacity management, request level monitoring etc,..

To better understand the break even point for “Development & Maintenance Costs” dimension, Lambda vs EC2 provide a very good set of products to compare.

When calculating costs to compare Lambda and EC2, one needs to consider not just the resource costs but also the development and maintenance costs, otherwise known as TCO (Total Cost of Ownership)

TCO = Development & Maintenance Costs + Infrastructure Cost

Calculating TCO is complex, as the “development & maintenance” costs vary between organisations and sometimes even within the same org, from team to team. This would depend on the team skill levels with that specific set of technologies, proficiency in managing the stack, and in some cases its not even team related but rather the organisation’s security posture and operational priorities.

For instance, some businesses might need to keep security patches upto date especially businesses that operate in highly regulated markets, but most businesses that have a top level firewall, may not need to patch the EC2 instance for every CVE.

Below graph provides a nice way of encapsulating the mental model of how Lambda and EC2 land in terms of Infrastructure Cost and Development & Maintenance Costs.

If we were to plot the total cost of the EC2 and Lambda for an application with same throughput, it would look something like below

As you can see, the EC2 has much higher development & maintenance costs when compared to lambda. if you were to super impose these graphs as below, you will start to see breakeven point beyond which the lambda starts to become costlier in all cases but below the breakeven point lambda is cheaper in most cases.

Its really hard to figure out the exact breakeven point in terms of total cost or in terms of total throughput because for some organisation the team could easily build solutions on EC2 with very little effort, but some other organisation it could be completely different. In some cases, it may not even be dependent on the team but rather on the maintenance requirements i.e some organisation may want upto date security patches, in-depth logging and metrics but some other organisation may not need them at all, so the EC2 maintenance costs are highly variable.

So the breakeven point for some organisations could be at a very low throughput itself but some organisations it could be at a much higher throughput or much higher total cost. Often an easy way to build a mental model of comparing is to calculate just the infrastructure cost without maintenance, and the difference in the infrastructure cost will make it clear if its worth extra maintenance effort for your workload.

Breakeven Point against Efficient Utilisation of Resources

While maintenance costs are variable and highly depends on a number of factors, calculating the breakeven point for 'Efficient Utilisation of Resources' is much more straightforward, mathematical approach.

Why Utilisation Matters? - lets take the above example of Lambda vs EC2, even though EC2 is cheaper than Lambda, it also matters how much the EC2 instance is utilised, for example a 50% utilised EC2 instance would most definitely be cheaper than Lambda but if the EC2 instance CPU & Memory utilisation is 5% then it most definitely be costlier than lambda. This is the key insight that plays a role in determining how utilisation affects the cost.

While one might argue that 5% utilisation is an operational issue and it should be right sized to appropriate instance size. There are scenarios where there might be valid & genuine reasons for that kind of utilisation

  1. The traffic pattern could be spiky so that the average utilisation over an entire day could be just at 5% but max utilisation could be at 50%.

  2. Autoscaling Inefficiency, if the EC2 instances are scaled depending on the SQS queue size rather than CPU/Mem Utilisation or based on thread count than utilisation, or the workload has spikes but the autoscaling scales down at much slower pace. All of those could result in a very low average utilisation over an entire day.

  3. Organisation might be in a growth phase and has no time to focus on proper maintenance, or initial expectations of maintenance are miscalculated etc,. In such cases one could also argue that if in last 12 months operational efficiency tasks consistently went into backlog and there is no bandwidth to reprioritise operational efficiency in the foreseeable future, it is cheaper to move to lambda than trying fight against the work configuration of the organisation

To further understand this type of breakeven point, let’s jump into comparing Elasticache On Demand vs Elasticache serverless.

Elasticache On Demand vs Elasticache Serverless

Imagine you are building an application that requires a cache that provides fast data access to enable a responsive, real-time user experience for an e-commerce website. You estimate that the application caches 10 GB of data most of the time, and grows to 50 GB during peaks for two hours during the day.

This is an example scenario of elasticache serverless blog by AWS, the scenario aims to keep 20% headroom i.e picking instance configuration in such a way that you get around 62.5GB (59.22 GB in blog which is pretty close) - 3 shards, 6 nodes of cache.r7g.xlarge

A few things we should note here is that this 20% headroom is on top of 25% reserved for backup use, which essentially means as an operator you are aiming to keep the max memory utilisation at 60% (i.e 100*(1-0.25)*(1-0.2) = 60%)

As shown in the AWS blog, this scenario shows elasticache serverless is cheaper, but if you closely inspect the cluster uses i.e 50GB only for 2 hours of the day, so if you calculate the max memory utilisation, it would be (50/59.22)*100 = 84% but if you calculate the average memory utilisation across an entire day, it would be (((50**2+10**22)/24)/59.22)*100 = 22.51%, but the DatabaseMemoryUsagePercentage metric doesn't count the reserved memory for backups (25%), which means the actual usage percentage here would be 16.88%

Breakdown

Memory Usage For 2 hours = 50GB

Memory Usage For 22 hours = 10GB

Average Memory Usage = (502 + 1022) / 24 = 13.33 GB

Total Memory Provisioned With 25% Reserved for Backup = 59.22

Total Memory = 59.22 / (1 - 0.25) = 78.96

Average Utilisation = Average Usage / Total Memory = 16.88

From this example its pretty clear that if your redis workload has average usage of 16.88% or below, its always cheaper to use elasticache serverless rather than elasticache on-demand.

If we go one step more and try to plot the utilisation percentage and prices of both elasticache cluster and elasticache serverless, we would start to see something like this

from above graph, we can see that the breakeven point for memory dimension is at around 26.56%. Which means for memory dimension, if your daily average cluster utilisation is below 26.56% then its always cheaper to use elasticache serverless.

One important thing to note is that, above analysis isn’t considering CPU Dimension, but performing breakeven point analysis on a single dimension provides a very good mental model around at what traffic spikiness the serverless becomes cheaper - so we could properly apply serverless only on the spiky workloads where it would be cheaper rather than blindly following the idiom for all spiky workloads.

A similar Breakeven Point can be observed for Fargate vs ECS/EKS on EC2, for detailed analysis refer - Why Fargate is Still too Costly?. In case of ECS/EKS on EC2, if you have consistently seen that the Cluster Reservation (operational efficiency metric for ECS/EKS on EC2) is below 85% then it’s cheaper to move to Fargate than keep using the EC2.

Breakeven Point against Feature Dimensions

Cost comparisons are not just limited to serverless and non-serverless but often there are multiple aws products providing same functionality but with different trade offs.

Similar to the trade offs in serverless vs non-serverless, different feature dimensions also determine the breakeven points at which one resource becomes cheaper than the other.

For understanding this type of break even point, let’s compare the cost of Aurora I/O Optimised vs Aurora Standard Mode

Aurora Standard mode charges you for the instances and also for IOPS consumed. But the Aurora I/O Optimised mode only charges you for the instances and offers IOPS for free regardless of total IOPS. The catch here is that aurora I/O optimised instances are approximately 30% more costlier than Aurora Standard instances.

For example, let’s take r7g.4xlarge instance of both Standard and I/O Optimized mode, the standard mode price is 2.211$ per hour and IO Optimized mode is 2.874$ per hour. and standard mode charges 0.2$ per million IOPS.

As you can see from the above graph, the cost of Aurora IO Optimized mode remains static regardless of IOPS but the aurora standard mode cost increases as you consume the IOPS. The break even point here happens to be 2.4 Billion / 2400 Million IOPS for 1 r7g.4xlarge node, and approximately 4.8 Billion IOPS for 2 nodes. Since the whole cluster can only be either Standard Mode or IO Optimized Mode, the breakeven point needs to be carefully considered as it changes according to the number of nodes.

Do note that above analysis doesn’t consider the extra charges of the storage in case of Aurora IO Optimized Mode, depending on your cluster storage, the breakeven point would move right as your storage increases.

To Summarise, What is Breakeven Point Analysis?

When there are multiple products offering similar functionality but with different tradeoffs and different cost models, then at different ends of the trade offs, one product trumps the other.

The breakeven point analysis tries to showcase this exact fact by determining at what point the cost of both products are same, this provides a better understanding of how to properly build a cost model when comparing two different products rather than just working out of idioms.

Key Takeaways

  • Beyond Idioms: While idioms are easier to understand and easy to build mental models, they are also too vague to easily make a bad decision. Over time, start relying on techniques like breakeven point analysis to better understand the trade offs between products

  • Calculating TCO: Calculating TCO can be quite overwhelming because of variable costs like Development and Maintenance costs, its rather easier to calculate non-variable costs and assess if the difference is worth the trade offs in variable costs.

  • Utilisation Matters: Non-Serverless products while cheap on paper require careful attention towards utilisation - otherwise they might end up being more costly than serverless

  • Resisting the Temptation to Migrate or Use All AWS Services: After performing the analysis, it might be tempting to migrate your existing workloads. However, it’s important to strategically assess these decisions considering both immediate benefits and long term implications. While some workloads are best suited for EC2 and some for Lambda, for businesses of decent size, diversifying across too many AWS services can introduce too much complexity and cognitive overhead as well.

Conclusion

All products have trade offs, and the breakeven point analysis framework provides a much more concrete way of comparing the products rather than some arbitrary examples that are too vague to prove why one product is better than the other and in which scenarios.

Stay up to date on managing cloud costs!

Opti Owl

Cut your cloud costs today

Stay up to date on managing cloud costs!

Opti Owl

Cut your cloud costs today

Stay up to date on managing cloud costs!

Opti Owl

Cut your cloud costs today

Stay up to date on managing cloud costs!