Supporting services out of hours
Many of our services are used outside normal office hours (9am-6pm, Monday-Friday). To make sure people are able to keep using them with confidence, even when problems arise, it’s sometimes necessary to have people on call out of hours to help with these problems.
Some services don’t need out of hours support. This might be because their users don’t rely on it for critical parts of their job, because the users themselves don’t need it outside normal office hours, or because the service is robust enough to recover from some problems automatically.
There are also cases where services may need out of hours support, but we’re not currently able to provide it without risking the health and well-being of our staff, usually because we don’t have enough staff who are able to support it. In these cases, we should either engage a wider group of staff to support a set of services, or seek the help of suppliers to provide sustainable support.
Determining if your service needs support out of hours
Not all services need out of hours support. When considering the need for out of hours support for your service, you need to understand what issues might arise if the service were to go down when no-one is around to fix it, and how likely it is that the service might go down.
The kinds of issues that might suggest out of hours support is necessary include things like:
- Danger to someone’s life
- The MOJ’s ability to provide critical, time-sensitive justice services
- Impact to staff’s and suppliers’ ability to carry out their job safely and to the best of their ability
- Impairing citizens’, suppliers’, and partners’ ability to interact with MOJ services in a way that meets their needs
These and similar issues should be considered alongside your service’s robustness.
For example, if your service can recover from most classes of issue itself, and downtime affects only a small group of people whose task is not time-sensitive, out of hours support may not be necessary. If, however, a problem with your service (whether outages, security breaches, or something else) out of hours presents a genuine risk to someone’s life, it may be appropriate to provide out of hours support around the clock, depending on the robustness of the system.
Assessing your team’s ability to provide out of hours support
To run a sustainable out of hours rota, you must first ensure that either:
- You have enough staff (including contractors) to support the service out of hours sustainably (realistically this means at least six people, ideally eight), and have the ability to pay them for their time supporting the service
- You have budget to pay a third party to provide your support sustainably, and have the ability to train them sufficiently before they are expected to start supporting it.
If you are unable to do either of these, you cannot support your service out of hours without risking people’s health and well-being. If that is the case, you should escalate this risk and not try to provide out-of-hours support until you are in a position to do one or the other of those.
It may be appropriate for a few related teams (for example, in a single service area) to share a common rota. This can allow teams that would otherwise be unable to run a rota sustainably to do so with some of their colleagues. To do this, though, you must ensure that everyone on the rota is sufficiently trained to support all of the services their rota is responsible for.
Managing a sustainable out of hours rota
Being on call can be stressful, and can impact people’s mental health and personal lives if not managed well. If that happens, people will withdraw from the rota, leaving the service unsupported.
To avoid that, a few core criteria must be met:
- No one person is on call 50% of the time or more (one week in every two), and ideally no more than 30% of the time (around one week in every three). This, along with the next criteria, mean that you will need at least six people, and ideally eight, to run a rota sustainably.
- At least two people are on call at any one time, and know which of them will be contacted first (primary on-call) and second (secondary on-call).
- If your service regularly has problems out of hours, or problems typically take multiple hours to resolve, you will need to take that into account when designing your rota. These sorts of situations will require more people, shorter shifts, or more layers of support. Remember that improving service robustness is a great way of simplifying and reducing the costs of providing out-of-hours support.
- If your service regularly has problems out of hours, your incident retrospectives should be ensuring that appropriate learning and investment is happening to improve how often the service has problems.
- The rota repeats regularly, so people know when they are on call over the coming weeks and months, and can plan accordingly. Ideally, the on-call rota should be planned at least three months ahead. While this might seem ambitious, you really want to give your support people as much time as possible to plan for potential anti-social hours of availability.
- People on call are appropriately paid for being available on call and for responding to incidents.
- People on call know when they are expected to be available when on call, whether it’s 24/7, 7am-10pm, or some other, agreed hours.
- People on call can trade shifts with one another when they need to.
- People on call for the first time are on call with someone who is confident in supporting the systems, and has been on call for them for some time, and ideally someone they know well. Another way to make going on call less intimidating is to have an in-hours support rota that everyone must have served on before they go on call.
- People on call understand the circumstances under which they will be paged. This should be a documented list of alerts with agreed steps to resolve.
Contacting people on call about an issue
To allow people to manage how they are contacted when on call, the MOJ uses PagerDuty. PagerDuty (and systems like it) allows people on call to specify how they prefer to be contacted (in-app notifications, phone calls, text messages) and when, relative to an incident being raised. It also ensures that the people on call are paged according to pre-defined escalation policies (primary first, then secondary, then repeat). To get access to PagerDuty or a similar tool contact the operations-engineering team on their slack channel #ask-operations-engineering.
People on call should never be contacted directly to report issues, especially not through their personal phone number. Contacting individuals directly creates a dependency on them, making them implicitly on call 100% of the time. This is not good for individuals’ mental health, and can result in them withdrawing entirely from support or even leaving the organisation, meaning we entirely lose their knowledge and skill at supporting the system, rather than allowing others to learn from it and them to work at a sustainable level.
This also applies to email and chat (eg Slack or Skype) messages. Do not assume that someone will respond when you message them, as they may not have notifications turned on for that service. Further, you should not share others’ personal contact details without their permission to allow people to get in touch with them out of hours.