For SLAs, there’s no such thing as 100% Uptime – only 100% Transparency

With the advent of cloud computing and its ubiquity over the past 10 years, the SaaS model has, brick-by-brick, revolutionized the way businesses all over the world operate – from processing payments to processing paychecks in HR, from measuring marketing ROI to boosting sales efficiency, the modern enterprise’s go-to reflex when looking at improving how it operates is to turn to SaaS.

While SaaS is undeniably an improvement over the former software licensing model, which came with bulky infrastructure, on-site maintenance, and less global scalability, SaaS is a global light switch that can instantly turn your business on and deliver it to literally billions of customers. But light-switches go both ways, and, as any enterprise that has ever suffered an outage knows, SaaS can cost big-time.

Just recently, Amazon’s own search went down for a few hours – even the most conservative estimates suggest that the outage cost them tens of millions in revenue. Outages like these are inevitable; however, the strongest SaaS providers today invest heavily both in strong uptime and having a strong SLA, so that customers know that, in the event that an outage impacts their business, that they have insurance.

Today, I’d like to talk a little bit about our brand new SLA policy, and how it came to be.

What’s an SLA?

Before getting started, just to make sure we’re all on the same page, let’s get our vocabulary straight.

A Service Level Agreement (SLA) is essentially an insurance policy in case the service you paid for doesn’t operate as advertised. In general, the more critical a service is to a company’s operations, the stronger the SLA will need to be to satisfy customer worries. Likewise, a stronger SLA is a stronger commitment from a service provider to ensure flawless service.

As a SaaS player and a customer of several external services for both our search platform and our operations, as a company we’ve seen (more than) our fair share of SLAs. As we evolved and improved our own SLA over the past two years – providing an increasingly strong & transparent promise to our customers – we began combing through SLAs with a fine-toothed comb. It turns out, not all SLAs were created equal.

Busting the myth of the 100% SLA

In recent years it’s become fashionable for companies to include 100% uptime guarantees in their SLA – and, in some cases, even more than 100%, despite its mathematical impossibility.

Now, don’t get us wrong – all service providers have an obligation to put 100% of their effort into keeping their service running like a well-oiled machine; however, the detection of an outage itself can sometimes even be impossible… until it’s too late, of course.

Being a SaaS provider on the internet implies dozens of dependencies on intermediary devices and networks, which themselves have downtime. When you promise 100% uptime, every millisecond of downtime counts – however, what if you can’t detect the outage? How can one tell if the issue comes from your connection dropping data, the service provider, or any of the dozens of intermediaries in between?

To resolve this issue, SLAs define a minimum outage necessary in order to be triggered. The market standard is typically 1 minute; however, 1 minute of downtime per month means 99.9977% uptime – so what exactly is 100% uptime then?

Let’s take a look at an example of a “100% SLA” from a SaaS provider on the market today:

“credit of 5% of the fees paid for the month in which we fail to provide the stated level of service for every 0.05% of such a month, during which you are unable to transmit information up to an aggregate of 50% of the monthly service billing”

This SLA only kicks in after 0.05% downtime in a month, which is a little over 20 minutes of downtime – and yet, that same SLA claims 100% uptime guaranteed.

How we define our Enterprise SLAs

When we set out to design our SLA, we had three goals:

  1. Make it Simple – you only ever look at an SLA before purchasing and when there is downtime, and, in both instances, simplicity is key.

Make it Transparent – no one wants unexpected surprises, especially if there is downtime. The easiest way to lose a customer is to let them down.

Trust our platform – we trust the the system we built, and we want an SLA that speaks to that trust.

At Algolia, we currently have two different setups for our customers:

  • Enterprise: we replicate your search on at least three different machines hosted by two different providers in different datacenters and autonomous systems
  • Premium: we replicate your search on at least three different machines hosted by three different providers in three different data centers with three autonomous systems using at least two different Tier1 upstream providers.

(Did we mention how inevitable dependency on intermediary services are?)

These setups are not different just on paper but they’re also different in terms of infrastructure and come with two different SLAs:

  • Enterprise: 99.99% uptime, each minute of downtime would make you eligible for 100 minutes of refund, up to a cumulative value of 100% of the monthly service billing.
  • Premium: 99.999% uptime, each minute of downtime would make you eligible for  1,000 minutes of refund, up to a cumulative value of 600% of the monthly service billing over a year.

 

Screen Shot 2016-08-10 at 10.04.33

 

Our outage detection starts at 30 seconds (0.001% of a month) instead of 1 minute. This is so granular that it can’t be measured with traditional monitoring architecture, so we built our own monitoring network that continuously monitors our API infrastructure, that gives us a fairly unique ability to detect downtime this fast.

Here’s what our refund policy looks like in practice:

 

Screen Shot 2016-08-10 at 10.06.00

 

Search down timeTotal refund of the monthly service bill
Enterprise SLAPremium SLA
30 seconds0%1%
1 minute0%2.3%
5 minutes1%11.6%
30 minutes7%70%
45 minutes10%100%
1 hour13.8%138%
2 hours27.7%277%
4 hours55.5%555%
8 hours100%600%

As you can see, with our Premium SLA, if our service is down 45 minutes, we refund you 100% of your monthly bill – it doesn’t get much simpler than that.

SLAs are more than just insurance

Most people don’t really see SLA as much more than a form of SaaS insurance – at Algolia, we see it as something much greater, a way to remind our customers of our reliability. We back our Premium SLA with our reinforced infrastructure and our goal is to make sure we provide the best service in the market – we don’t want downtime any more than you do, and we put our money where our mouth is. We incentivize ourselves to do everything possible to ensure that the probability of an outage is as close to zero as possible!

It has been a year since we introduced our three provider set-up, and, with it, we’ve been able to placate the worries of even the most cautious of customers.

Our setup has been extensively tested – with outages of entire datacenters and networks – and we’ve still been able to maintain 100% uptime.

To the best of our knowledge, our Premium SLA is unique to the market – in terms of simplicity, transparency & refund guarantee –  and we’d love to tell you more about it if you have any questions, or would like to see how your current SLA stands up against ours!