Tips for reducing the cost of your infrastructure

In today’s complex world, ops engineers and SREs get to worry not just about quality of infrastructure, but about cost reduction. Here, I’ll share a few tips on reducing cost of servers with popular cloud providers, as well as a general approach for those to whom this type of work is new.

When you get hired as an Ops engineer or an SRE, you probably know what you are getting into and what you are supposed to do: things like maintaining servers, developing build and deployment pipelines, provisioning and initiating cloud servers, monitoring services and the likes come to mind. One thing that does not usually come to mind when discussing the role of the operations engineer are finance-related activities such as cost analysis and cost reduction.

This, as it turns out, can sometimes be quite a big part of our jobs in companies.

Who would have thought, for example, that I’d become quite familiar with a term like COGS, or “cost of goods sold”, defined by Investopedia as “the direct costs attributable to the production of the goods sold by a company”. In plain English: “how much does it cost to create a product before selling it”, or, if you prefer plain engineering English: “how much do we spend to have our services running in production”. Basically, this is about cost reduction: you start off with creating a budget in order to understand how much you are spending, how much you are going to spend, and whether your costs are going up or down. Let’s look step by step on how to do this with cloud services.

Steps to success with (cloud) cost reduction

Handling cost in today’s cloud is very similar to other analysis tasks such as monitoring. You have to:

  1. 1. Collect billing data
  2. 2. Analyze and visualize the data
  3. 3. Alert on issues
  4. 4. Act

Collect

Amazon Web Services (AWS) has a very nice billing dashboard, They basically collect the data for you, so you can just skip to the analyze part \o/. Google cloud platform (GCP) does not currently provide a very convenient way to see your billing data, but they do provide a way to export your billing data to BigQuery.

Analyze

There are all kinds of ways to analyze your data, but a simple way you can start with is looking at 2 types of analyses:

  1. 1. Monthly Analysis

Month-by-month analysis over a period of a year (or even less than a year) can give you interesting insights, such as: is the cost of the infrastructure increasing at the same rate as the growth of the business?

2. Daily Analysis

Track the changes you have made this week, and see if they have the effect on the monthly budget that you expected.

AWS has a great tool for cost analysis — the Cost Explorer:

It is a part of the billing dashboard and it has very strong analytical abilities: you can analyze cost by date (yearly/monthly/daily), and it has hundreds of other dimensions that can be configured — such as what machines and databases you are spending on, when things start going up or down, etc. A good practice here is to tag assets; for example, you can tag an instance by its principal user so you can quickly track back to the person over-utilizing or under-utilizing a machine. Another good practice is to tag by product which will make it easy to know at a certain point in time how much a project costs.

In GCP — once your data is in BigQuery — you can use Google’s Data Studio, or Redash, an open source tool to do quite sophisticated analyses. Check out this elaborate blog post from Google.

Alert

Both AWS and GCP let you set up billing alerts, like this one:

In AWS, you can set up billing alerts (such as notifications of costs exceeding the free tier) in a granular way.

Act

  • Cleaning. The first and easiest thing to do is start cleaning stuff up — by that, I mean removing machines that are not used. We all start test machines that we plan deleting at the end of the day, but they often stay behind and incur cost for…a while later. I suggest a continuous cleaning plan for machines, instances, IP addresses, files from S3 (here, you can use the lifecycle management, which automatically moves files from hot storage to cold storage, and then deletes them per the schedule you set), EBS snapshots (if you backup your data)…and — very important if you are using AWS — you should not forget to check all regions.
  • Data transfer. In cloud services, data transfer is quite complicated. Here is an example:

AWS data transfer costs

You can see that cost of traffic between different regions and AWS services differs greatly, so it can be very important to understand where you are transferring data to and from.

One tip: communication between regions is relatively expensive, but can be cheaper between regions that are close, or if one of them is new. If you want to be multiregional, one way to save is by finding cheaper connections between regions.

Another tip: you can reduce the pricing of load balancers by simply asking for a discount: cloudfront and CDN pricing are quite negotiable at high volumes.

  • Compute is more simple and there are quite a few ways to save quite a bit of money. Here are a few:

– AWS: Switch to a new generation of instances. For example, the M3 instance had an inferior CPU, less memory, and cost more than the new M4. The recently released C5 instance family has faster CPU than C3 and C4 and once again costs less.

– AWS: Reserved instances allows you to get significant discounts on EC2 compute hours in return for a commitment to paying for instance hours of a specific instance type in a specific AWS region and availability zone for a pre-established time frame (1 or 3 years). Further discounts can be realized through “partial” or “all upfront” payment options.

– AWS: EC2 Spot instances are a way to get EC2 resources at a significant fluctuating discount — often many times cheaper than standard on-demand prices — if you’re willing to accept the possibility that they be terminated with little to no warning if you underbid. I highly recommend a company by the name of Spotinst that can manage the hard work and uncertainty for you.

– GCP: if your workload is stable and predictable, you can purchase a specific amount of vCPUs and memory for up to a 57% discount off of normal prices in return for committing to a usage term of 1 year or 3 years.

– GCP preemptible VM is an instance that you can create and run at a much lower fixed price than normal instances on GCP. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. They also last a maximum of 24 hours.

  • Tagging for cost visibility. As mentioned above, as the infrastructure grows, a key part of managing costs is understanding where they lie. It’s strongly advisable to tag resources, and as complexity grows, group them effectively.
  • Human “resources”. Don’t be shy about asking your account manager for guidance in reducing your bill. It’s their job to keep you happily using their service.
  • Serverless. I am not going to advise moving all your servers to a serverless architecture. It’s not great for everything, but let’s say you have a cron machine, and it’s running something every hour – perhaps moving it to serverless makes more sense. You pay per function run, and save additional money by saving on operation costs: looking at logs, maintaining the server, etc.
  • Using multi-cloud. This is complicated and not for the faint of heart. You need to deeply think about your architecture before you start, and the real downside is that sometimes you can’t do all the cool stuff that each cloud provider gives you (e.g., cloud formation which works on Amazon but not Google). There are some tools that can help you be multi-cloud, like Ansible or Terraform, or Dockerizing everything to run everywhere.However, you can get credits from Google, then credits from Amazon (then you ask for a few more); you have account managers in both companies and can negotiate with the leverage of actually using the competing service (this of course does require having some volume in production).
  • Go bare metal. Many past disadvantages of bare metal are going away; bare metal companies like LeaseWeb and OVH now have APIs — you can create a server or replace a broken hard drive on your machine without talking to anyone. The prices are significantly lower.
    We heavily use bare-metal servers and you can read more about it here.

Bonus tips

  • Ec2instances.info This is a great tool that helps you see all your instances in AWS, including prices for different regions; you can compare selected services. It helps you see things clearly in one place. Note: while it updates regularly, I would always double check the information.
  • Keep learning: the Open Guides for AWS is updated daily and a fantastic place to learn; including their billing and cost section that has many of the things I talked about and more.

Compute price comparison

Let’s take the example of a pretty strong machine: AWS EC2 M4.10XLARGE – [40vCPU, 160GiB RAM, 10Gig Network], and compare its prices with other possibilities and different optimizations (the table is sorted by provider).

Provider Machine TypePrice / MonthComment
AWS – OnDemandm4.10xlarge$1,460.00
AWS – Reserved 1 Yearm4.10xlarge$904.47
AWS – Reserved 3 Yearsm4.10xlarge$630.72
AWS – Spotinstnce~$447.696Prices change very frequently
GCP – Sustained PriceCustom Instance$1,054.00
GCP – UpfrontCustom Instance$873.77
GCP – PreemptibleCustom Instance$314.40
LeaseWebR730XD$374.992x 10 cores
256GB DDR3 RAM
2x480GB SSD
10 TB traffic
OVHMG-256$365.9920 cores
256GB
Disks 2x2TB

The first thing you can easily spot is that bare metal providers such as leaseweb and ovh will provide the best value for the buck, and they also include storage and traffic in the same package. You can also see that those bare metal providers are much less flexible in terms of machine types.

Another thing we need to consider is that in a cloud environment we can pay for only what we use, so if we need a machine for an hour a day, we actually don’t have to pay a monthly fee, and this might reduce costs dramatically, especially if we use spot instances or preemptible instances.

Here at Algolia we actually chose a mix of providers. Using bare metal for the Algolia engine and API was the best decision for us, but we also use Google Cloud Platform for our log processing and analytics, and AWS for many different production and internal services.

The bottom line is that, as always with building and maintaining a robust infrastructure, you need to choose what’s best for your company and your use case. Hopefully, tips above will help you make the right choices at the right price. Have other tips? We’d love to hear them: @eranchetz, @algolia.