The ultimate guide to SLOs for SaaS support

As a SaaS customer support or success leader, are you managing support SLAs? Are you finding it hard to have well managed internal SLOs across support and R&D teams (product and engineering)? Then, THIS is for you!

Software products have to meet support Service Level Agreements (SLAs), such as first response and periodic update times, to ensure timely customer outcomes.

But managing those SLAs becomes harder due to interdependencies across support and R&D teams. Which then turns into customer escalations and then into potential churn.

To ensure smooth running SLAs across organizations, we need internal SLOs, i.e. internal agreements between teams that are on the critical path: support and R&D.

For the rest of this post we will use the term SLOs and internal SLOs interchangeably. In this guide we will cover: why internal SLOs are needed, SLO definition framework, enforcement, reporting, implementation example, and conclusive takeaways.

Why internal SLO?

But why can’t I just apply the same SLAs across departments? Glad you asked 😬 It’s because SLAs are stricter customer facing commitments and we need to have separate internal agreements loosely tied to SLAs. Hence internal SLOs gave birth - Ta Da!!

Internal SLO (also called OLA) has been defined as an agreement that describes the responsibilities of each internal group toward support, including the process and timeframe for delivery, as per wikipedia.

But, an incredible 76% of SaaS support teams find it hard to manage SLOs between support and product teams, as per our survey below.

SLO problem survey

But it begs the question: why is it so hard? Well it is hard to enforce and automate the SLOs especially when support and R&D teams are using different tools. This clearly resonated with support community as per the survey below:

SLO root cause

Summarizing the above results, it is clear that it is hard because of 3 reasons:

  1. Defining consistent and coherent SLOs is challenging.  
  2. Enforcing those SLOs across tools is hard (including manual pinging and escalations).
  3. Reporting on SLOs is hard when support and product teams are in silos.

Now that we know why it is a problem, how do we go about solving it? Well, the answer lies in two things:

  1. Defining practical SLOs based on your (customer facing) SLAs
  2. Coming to an agreement with your R&D counterparts.

What’s a good SLO?

Coming up with great SLOs is an art and science, especially as it involves agreement across support and R&D teams. The solution is a 3 step framework: Defining, Enforcing, and Reporting.

Step 1: Defining SLO rules

The SLO rule definition consists of 3 entities: Condition, Trigger, and Action (CTA).

SLO definition framework

Condition: This is the condition upon which out of SLOs are triggered. This has 3 aspects: customer, support ticket, and engineering issue.

  1. Customer: This includes customer revenue, support tier, renewal, sentiment etc.
  2. Support Ticket: This includes priority, impact, type etc, from support tools like Zendesk, Freshdesk etc.
  3. Engineering Issue: This includes the linked engineering tasks in tools like Jira, Asana etc, which consists of type (bug, improvement, feature request, task..), priority, severity etc.

Trigger: This is the trigger that gates the condition. This has 3 aspects.

  1. Time Without First Response: Time elapsed between ticket being opened to getting a first response from the product team.
  2. Time Since Update: Time since the last update made by the product team within tools.
  3. Time Since Open: Time since the ticket is open without a resolution with a linked engineering issue also being in open.

Action: This is the action to be taken based on the trigger and condition. This further has 3 aspects.

  1. Email: Email alerts to specific email groups, individual people, etc.
  2. Ping: Customized messages to a group of people, specific persons, or escalations among the management chain, if needed.
  3. Label: Field updates in tools, so that they are highlighted in tools that teams use or used in dashboard filtering during team meetings.

While crafting these SLO definitions, we’d need to ensure these rules work to support SLAs, such as time to resolution, first response time, etc. An example is shown below to provide a rough idea on making use of this framework.

Example

SLO example

Well, how to define the severity of incoming support tickets? Defining severity on a support case majorly relies on the following:

  1. Type of issue (cosmetic, outage, minor bug, etc)
  2. Customer impact (users affected, revenue lost etc).

The above examples might not be enough, we’ll be sharing more detailed templates on severity, conditions, and triggers soon. Stay tuned :)

Now onto the next step of enforcing these SLOs.

Step 2: Enforcing SLOs

The action part of the SLO rule definition should help with enforcement. Let’s take a deeper dive into 3 ways of accomplishing it.

Digests

This a list of issues to send across support and R&D. Agree on a cadence and email groups across support and R&D teams. We’d recommend two lists, one for critical issues and another for non-critical issues.

  • Critical: Include high severity issues that are impacting customer SLAs and those that will be out of SLO in 24 hours.
  • Important: Include medium severity issues that will be out of SLO in 24 hours.
  • Non-critical: Include issues that will be out of SLOs in a few days.
  • More guidance on reporting is included in the next section.

Pings

Direct messaging is the best way to take action before it’s too late. Agree on an escalation process with R&D counterparts. This should include:

  • Direct messages to the respective engineers or product managers if no action has been taken for X days.
  • Messages to managers of engineers and product managers after Y days of no response.
  • Messages to director / VP level for critical issues with no action for Z days.

Live meetings

These are recurring meetings to keep support an R&D teams in sync on customer issues.

  • Weekly meetings: Agree on a weekly cadence with product and engineering teams to go through both critical and non critical issues.
  • Critical ad hoc meetings: Critical live meetings should be reserved only if the SLO breaches involve significant customer impact when there is no clear solution available from R&D teams.

More on implementing these enforcements in the implementation section.

Step 3: Reporting on SLOs

Regular reports on SLOs, ensure that cross functional teams and leadership are in the loop on ongoing issues. Here are 3 levels of reporting:

  • Team level: These reports should be sent out to people involved with the issues, i.e. support, engineering, and product managers. This should include Out of SLO and soon to be out of SLOs), broken down by support tiers, and priorities.
  • Customer level: List of issues and their status for top N customers so that CS leadership can review before going into customer meetings.
  • Organization level: Top N critical issues and trends which can be sent to leadership. This should include age of the issue, type of SLO breach, impact to SLA, and sorted by next SLO breach. It’s important to summarize information into risks (i.e need help from leadership), concerns (i.e. have a solution, but will miss an SLA/SLO), and mentions (future SLOs/SLAs at risk).

We’ll share a reporting template for you soon :) Now onto implementing this framework.

How to implement SLOs?

Ok, frameworks are fine, but how do we go about implementing them in our existing tools? There are so many support and R&D tools out there, so I’ll provide a high-level example with two popular tools: [1] Zendesk, [2] Jira, where Zendesk is used by support teams and Jira used by R&D teams.

We haven’t found a perfect solution (leave your ideas/suggestions in comments), but here are high level steps should you embark on this journey:

Step 1: Zendesk Jira integration

  • Install Zendesk native app in Jira marketplace
  • Install Jira app built by Zendesk within Zendesk marketplace
  • Create custom fields in Zendesk that map to Jira fields (e.g. eng priority which maps to priority in Jira, so that you know whenever there is difference between support and eng priorities).
Zendesk Jira integration

Step 2: Zendesk configuration

  • SLA configuration: Go to objects and rules → Service level agreements.
  • Then configure conditions based on Zendesk specific fields (type, support tier etc) and custom Jira fields in Zendesk (engineering priority, etc).
  • Then configure triggers like Time Without First Response using first reply time field within the target section. Example below:
Zendesk SLA configuration
  • Action automation: Email automation could be achieved within objects and rules → automation. Example below:
Zendesk SLO automation
Zendesk email triggers

Step 3: Jira configuration

  • In addition to reports on Zendesk, it’s beneficial to implement reporting on Jira so that R&D teams are in the loop. You can do this in Jira by selecting filters → Advanced issue search, then use JQL (Jira Query Language) to filter using “jira-escalated” tag, days since opened, priority etc.

In addition, If your teams work in slack, you’d also need to install slack integrations.

Now, you might be asking: But Zendesk only monitors updates on the support side, how do I monitor Jira updates? How do I automate pings? How do I build reporting across support and R&D involving both SLAs and internal SLOs? How do I automate follow-ups based on our escalation process?

We hear you, unfortunately we didn’t find a seamless way to implement and automate SLOs across multiple tools. There are workarounds using Zapier integrations across Zendesk and Jira, but even that becomes a Frankenstein solution that doesn’t scale well.

At Rejoy, we are working on something new to make this truly seamless so that you can manage SLOs/SLAs efficiently and effectively. Sign up here to stay tuned for our pre-launch updates. It's going to be groundbreaking! :)

Conclusion

Brownie points to you for making this far. Here are the 3 key takeaways:

  1. Define SLOs across support and R&D: Follow the 3 part SLO framework of defining, enforcing, and reporting. Specifically, remember the conditions, triggers, and actions rule definitions.
  2. Agree on enforcement policy: Defining an SLO without enforcement policy is moot, as shown in our surveys. Get an agreement on escalation and reporting process across support and R&D teams.
  3. Automate across tools: it’s impossible to manage customer SLAs and internal SLOs without an automated system that works across your support and R&D tools. There is no seamless solution today that scales across your teams and tools, but there are few workarounds in the interim. Stay tuned for our pre-launch, as we work to automate the heck out of this process, so that you never miss an SLA due to SLO misses!

Ok, now, how do you manage your internal SLOs? What automations do you use? Feel free to provide feedback, I’d love to hear from you! In my next article, I’ll cover methods and frameworks to influence your R&D teams.

Special Thanks: Kat and GT for providing feedback on an early version of this post 🙏

Want an editable copy of the above flowchart? Get access below.
Thank you! We will share the editable flowchart copy shortly!
Oops! Something went wrong while submitting the form.

Is this flowchart close or far away from the process at your organization? Let me know if you have questions or comments at sri@rejoy.io. I'd love to hear from you!

Once you have established a process across teams, the next hurdle is to automate this process especially when your teams and data are spread across several tools. In this blog post, I dive deeper into how to automate your process using internal SLOs across your cross functional teams.